Skip to main content
Version: nightly

Compress LLMs With SparseGPT

This page describes how to perform one-shot quantization of large language models using SparseML. This workflow requires a GPU with at least 16GB VRAM and 64GB of system RAM.

Note on system requirements

Due to inefficiencies in PyTorch ONNX export, a lot of system memory is required to export the models for inference. There are improvements coming in 2.2.

How to Clone and Install the Latest SparseML

You'll need the latest version of SparseML to run the one-shot workflow. We recommend that you do this from source and in a fresh Python environment to avoid any issues.

Clone the SparseML repo and install it locally:

git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"

How to One-Shot TinyLlama

TinyLlama-1.1B-Chat is an LLM that we can quantize in a short time because it has 1.1B parameters.

Perform one-shot using the OBCQ algorithm. The command takes the following parameters:

positional arguments:

  • model_name a path to Hugging Face stub
  • dataset_name Hugging Face dataset to extract calibration data from. Example of supported datasets: {c4,evolcodealpaca,gsm8k,open_platypus,ptb,wikitext2}

options:

  • --dataset_config_name Specific configuration to extract from the dataset, i.e. wikitext2-raw-v1 for wikitext2
  • --nsamples number of samples to extract from the dataset, defaults to 512.
  • --seqlen Maximum input sequence length to truncate calibration data to, defaults to model's max sequence length
  • --concat_data Whether or not to concatenate samples to fill the full seqlen, defaults to False
  • --output_dir the directory where the model will be saved, defaults to obcq_deployment.
  • --recipe the file containing the one-shot hyperparameters.
  • --device which device to load the model onto, either cpu or a specific cuda:0.
  • --precision precision to load model as, either auto (default), half, full, float16 or float32.

Example command:

wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/recipe.yaml # download recipe
sparseml.transformers.text_generation.oneshot --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dataset_name open_platypus --recipe recipe.yaml --output_dir ./obcq_deployment --precision float16

How to Evaluate the One-shot Model

Next, evaluate the model's performance using the lm-evaluation-harness framework.

Clone the forked repository with SparseML support and install it:

git clone https://github.com/neuralmagic/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

Evaluate on the hellaswag task:

start=`date +%s`
python main.py \
--model hf-causal-experimental \
--model_args pretrained=obcq_deployment,trust_remote_code=True \
--tasks hellaswag \
--batch_size 64 \
--no_cache \
--write_out \
--output_path "obcq_deployment/hellaswag.json" \
--device "cuda:0" \
--num_fewshot 0
end=`date +%s`
echo Execution time was `expr $end - $start` seconds.

The results obtained in this case are:

Running loglikelihood requests
100%|██████████| 40145/40145 [20:47<00:00, 32.19it/s]
{
"results": {
"hellaswag": {
"acc": 0.40141406094403503,
"acc_stderr": 0.004891826692722827,
"acc_norm": 0.5115514837681737,
"acc_norm_stderr": 0.004988449593007253
}
},
"versions": {
"hellaswag": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": {
"pretrained": "/home/mwitiderrick/neuralmagic/sparseml/obcq_deployment",
"trust_remote_code": true
},
"num_fewshot": 0,
"batch_size": "64",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": true,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
hf-causal-experimental (pretrained=/home/mwitiderrick/neuralmagic/sparseml/obcq_deployment,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
| Task | Version | Metric | Value | | Stderr |
| --------- | ------: | -------- | -----: | --- | -----: |
| hellaswag | 0 | acc | 0.4014 | ± | 0.0049 |
| | | acc_norm | 0.5116 | ± | 0.0050 |

Execution time was 1288 seconds.

Repeat the above on other tasks such as truthfulqa-mc, winogrande, and drop.

How to Export the One-Shot Model

Once you are certain the model is performing as expected, you can export it for inference. The sparseml.export command provides the functions for doing this. Running the command below creates a deployment directory containing all the artifacts that are needed for inference with DeepSparse. It will also inject KV Cache to reduce the model’s computational overhead and speed up inference by caching the Key and Value states.

sparseml.export --task text-generation obcq_deployment/

Using the Model With DeepSparse

Next, run inference using DeepSparse. Ensure you have the latest version of DeepSparse installed with pip install -U deepsparse[llm]

from deepsparse import TextGeneration

prompt = "How to get in a good university?"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

model = TextGeneration(model="obcq_deployment/deployment")
print(model(formatted_prompt, max_new_tokens=200).generations[0].text)
"""
There are many factors to consider when choosing a university. Here are some tips for getting into a good university:

1. Research your options: Consider the schools in your area and the ones in your desired location. Research their reputation, tuition, and academic programs.

2. Apply to multiple universities: Apply to multiple universities, ensuring that you are applying to the best option for you.

3. Get a job: If you are applying to a university, you will need to find a job to support your studies. This will help you budget and manage your time.

4. Get involved with your community: Your university will likely have a community of students and faculty. Engage with this community by volunteering, participating in clubs, and engaging with others in your community.

5. Get involved with extracurricular activities: Universities often have many extracurricular activities, which can help you meet new people.
"""

Check out the DeepSparse pipeline text generation docs for the full list of supported parameters.

Upload Model to Hugging Face

You may want to upload the one-shot model to Hugging Face for ease of reference or to share it with your colleagues.

Head over to your Hugging Face account and create a model named TinyLlama-1.1B-Chat-v0.4-pruned50-quant. Then upload the one-shot model:

from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="deployment",
repo_id="YOUR_HF_USERNAME/TinyLlama-1.1B-Chat-v0.4-pruned50-quant",
repo_type="model",
token="HF_WRITE_TOKEN"
)

Explaining the TinyLlama Recipe

A recipe is a set of hyperparameters that provide detailed instructions on how the one-shot quantization should be done. The recipe performs quantization in one shot, meaning that no retraining of the LLM is required.

We will now walk through what the different hyperparameters mean and why they are set to those values.

The SmoothQuantModifier is a technique used for dealing with outliers in the weights and activations of the LLM because quantization is very sensitive to large variations in their values. For TinyLlama a smoothing_strength value of 0.8 resulted in a model with repetitions in its output but the problem was solved by lowering the value to 0.5.

The ignore parameter under QuantizationModifier allows us to define operations that either don't make sense to quantize or operations that are too sensitive to quantize. Performing quantization on sensitive operations will affect the final accuracy of the model. We also don't quantize the inputs to the embedding layer.

Under SparseGPTModifier, we define sparsity as 0.5 because we are aiming for a model that is 50% quantized. The other parameters are:

  • block_size determines the number of columns to compress in one pass.
  • quantize whether or not to quantize weights during SparseGPT. A default quantization modifier will be applied when quantize is set to True and there is no QuantizationModifier in the recipe.
  • dampening_frac amount of dampening to apply to H, as a fraction of the diagonal norm.
  • sequential_update whether or not to update weights sequentially by layer, True saves on GPU memory.
  • mask_structure string to define the structure of the mask to apply, "0:0" means that it's an unstructured mask. Setting it to "16:32" would mean that 16 out of every 32 weights will be zeroed out (structured sparsity).
  • targets list of layer names to compress during OBCQ, or 'ALL' to compress every layer in the model.
test_stage:
obcq_modifiers:
SmoothQuantModifier:
smoothing_strength: 0.5
mappings: [
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
]
QuantizationModifier:
ignore:
# These operations don't make sense to quantize
- LlamaRotaryEmbedding
- LlamaRMSNorm
- SiLUActivation
# Skip quantizing the BMMs
- QuantizableMatMul
# Skip quantizing the layers with the most sensitive activations
- model.layers.21.mlp.down_proj
- model.layers.7.mlp.down_proj
- model.layers.2.mlp.down_proj
- model.layers.20.mlp.down_proj
- model.layers.19.mlp.down_proj
post_oneshot_calibration: true
scheme_overrides:
Embedding:
input_activations: null
weights:
num_bits: 8
symmetric: false
SparseGPTModifier:
sparsity: 0.5
block_size: 128
sequential_update: true
quantize: true
percdamp: 0.01
mask_structure: "0:0"
targets: ["re:model.layers.\\d*$"]

How to Adapt a Recipe for a New Model

You can modify the above recipe to perform one-shot quantization on other models, for example Mistral.

Perform the following modifications on the recipe to one-shot a Mistral model.

  • Define the operations we want to skip during quantization, that is sensitive layers and operations that don't make sense to quantize.
  • Declare the desired sparsity level, the same as the one for TinyLlama.
  • State the layers to compress during OBCQ.

Here is what the final recipe looks like:

test_stage:
obcq_modifiers:
SmoothQuantModifier:
smoothing_strength: 0.5
mappings: [
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
]
QuantizationModifier:
ignore:
# These operations don't make sense to quantize
- MistralRotaryEmbedding
- MistralRMSNorm
- SiLUActivation
# Skip quantizing the layers with the most sensitive activations
- model.layers.1.mlp.down_proj
- model.layers.31.mlp.down_proj
- model.layers.30.mlp.down_proj
- model.layers.30.mlp.gate_proj
- model.layers.30.mlp.up_proj
post_oneshot_calibration: true
scheme_overrides:
Embedding:
input_activations: null
weights:
num_bits: 8
symmetric: false
SparseGPTModifier:
sparsity: 0.5
block_size: 128
sequential_update: true
quantize: true
percdamp: 0.01
mask_structure: "0:0"
targets: ["re:model.layers.\\d*$"]

Save the recipe to a file named recipe.yaml.

Run one-shot quantization on any Mistral-based model, for example, zephyr-7b-beta:

sparseml.transformers.text_generation.oneshot --model_name HuggingFaceH4/zephyr-7b-beta --dataset_name open_platypus --recipe recipe.yaml --output_dir ./output_oneshot --precision float16

We set precision to float16 because quantization is not supported for the bfloat16 data type as of this writing.

Repeat the other processes as shown previously.

Conclusion

In case of any questions, submit an issue on GItHub or join other LLM developers on our community.