Skip to main content
Version: nightly

Optimizing LLMs

This guide delves into optimizing large language models (LLMs) for efficient text generation using neural network compression techniques like sparsification and quantization. You'll learn how to:

  • Sparsify Models: Apply pruning techniques to eliminate redundant parameters from an LLM, reducing its size and computational requirements.
  • Quantize Models: Lower the numerical precision of model weights and activations for faster inference with minimal impact on accuracy.
  • Evaluate Performance: Measure the impact of sparsification and quantization on model performance and accuracy.


  • Training Environment: A system that meets the minimum hardware and software requirements as outlined in the Install Guide.
  • SparseML LLM Installation: An environment with DeepSparse for LLMs installed as outlined in the Install Guide.
  • Background: Familiarity with Generative AI and working with large language models is recommended.

Sparsifying a Llama Model

We'll use a pre-trained, unoptimized Llama 2 7B chat model from the SparseZoo. The model is referenced by the following SparseZoo stub:


For additional models that work with SparseML, consider the following options:

Data Preparation

SparseML requires a dataset to be used for calibration during the sparsification process. For this example, we'll use the Open Platypus dataset, which is available in the Hugging Face dataset hub and can be loaded as follows:

from datasets import load_dataset

dataset = load_dataset("garage-bAInd/Open-Platypus")

For comprehensive data preparation guidelines, including formats like CSV and JSONL, refer to our detailed datasets guide.

One Shot

Applying pruning and quantization to an LLM without fine-tuning can be done utilizing recipes, the SparseGPT algorithm, and the compress command in SparseML. This combination enables a quick and easy way to sparsify a model, resulting in medium compression levels with minimal accuracy loss, enabling efficient inference.

The code below demonstrates applying one-shot sparsification to the Llama chat model utilizing a recipe. The recipe specifies using the SparseGPTModifier to apply 50% sparsity and quantization (int8 weights and activations) to the targeted layers within the model.

from sparseml.transformers import (
SparseAutoModelForCausalLM, SparseAutoTokenizer, load_dataset, compress

model = SparseAutoModelForCausalLM.from_pretrained(
tokenizer = SparseAutoTokenizer.from_pretrained(
dataset = load_dataset("garage-bAInd/Open-Platypus")

def format_data(data):
return {
"text": data["instruction"] + data["output"]

dataset =

recipe = """
run_type: oneshot
ignore: [LlamaRotaryEmbedding, LlamaRMSNorm, SiLUActivation, MatMulOutput_QK, MatMulOutput_PV]
post_oneshot_calibration: true
num_bits: 8
symmetric: true
strategy: channel
num_bits: 8
symmetric: true
input_activations: null
num_bits: 8
symmetric: false
sparsity: 0.5
quantize: True
targets: [model.layers.0, model.layers.1, model.layers.2, model.layers.3, model.layers.4, model.layers.5, model.layers.6, model.layers.7, model.layers.8, model.layers.9, model.layers.10, model.layers.11, model.layers.12, model.layers.13, model.layers.14, model.layers.15, model.layers.16, model.layers.17, model.layers.18, model.layers.19, model.layers.20, model.layers.21, model.layers.22, model.layers.23, model.layers.24, model.layers.25, model.layers.26, model.layers.27, model.layers.28, model.layers.29, model.layers.30, model.layers.31, lm_head]


After running the above code, the model is pruned to 50% sparsity and quantized, resulting in a smaller model ready for efficient inference.


Evaluating Accuracy

Evaluating the model's accuracy is important to ensure it meets the desired performance requirements. To do so, we can use the following code to evaluate the model's perplexity on a sample dataset:

from sparseml import evaluate

eval = evaluate(
text_column_name=["prompt", "canonical_solution"]

After sparsifying the model, it is ready for evaluation and deployment. To test the model's generation capabilities, we can use the following code to generate text utilizing PyTorch:

from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer

model_path = "./one-shot-example/stage_compression"
model = SparseAutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = SparseAutoTokenizer.from_pretrained(model_path).to(model.device)
inputs = tokenizer(["Large language models are"], return_tensors="pt")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

The above code, however, does not leverage the sparsity within the model for efficient inference. To do so, we need to export the model to ONNX to be ready for efficient inference on CPUs with DeepSparse. SparseML provides a simple export command to do so:

from sparseml import export


The exported model located at ./exported can now be used for efficient inference with DeepSparse. To do so, sub in the exported model within the previous Getting Started - Deploy guide for your desired deployment method.

Want to dive into more about one-shot sparsification with Neural Magic? Here are a few paths to consider: