Neural Magic LogoNeural Magic Logo
DeepSparse EngineSparseMLSparseZoo
Optimize for Inference

Optimize a Model for Inference

The Neural Magic Platform enables you to optimize your models for inference with sparsity.


There are multiple factors to consider when creating a deep learning model. In training, accuracy on the test-set is the primary metric. In deployment, however, the performance (latency/throughput) of the model becomes an important consideration at production scale.

However, as Deep Learning has exploded and state-of-the-art models have grown bigger and bigger, performance and accuracy have been increasingly at odds.

Sparsity: Improve Performance While Maintaining High Accuracy

SparseML and SparseZoo work together to help users create performance-optimized models while mimizing accuracy loss, using sparsification techniques called pruning and quantization.

Importantly, they support training-aware pruning and quantization algorithms (as well as post-training). Training-aware techniques apply the sparsification gradually, allowing the model to adjust by fine-tuning the remaining weights with the training dataset at each step. This technique is critical to maintain high accuracy at the high levels of sparsity needed to reach GPU-class performance.

Conceptual Guide

What is Pruning?

Pruning is the process of removing weights from a trained deep learning model by setting them to zero. Pruning can speed up a model, because inference runtimes implement optimizations that "skip" the multiply-adds by zero, reducing the needed computation.

There are two types of pruning that can be applied to a model:

  1. Structured Pruning - weights are pruned in groups (e.g. entire channels or nodes)
  2. Unstructured Pruning - weights (or small groups of weights) can be pruned in any pattern

With Structured Pruning, it is easy for an inference runtime to include optimizations that speed-up the model and most runtimes will benefit from this type of pruning. However, structured pruning can have large negative impacts on accuracy of the model, especially at the high levels of sparsity needed to see speedups.

With Unstructured Pruning, it is very hard for an inference runtime to include optimizations that speed-up the model (as far as we know, DeepSparse is the only production-grade runtime focused on speed-ups from unstructured pruning). The benefit of unstructured pruning, however, is that sparsity can be pushed to very high levels while maintaining high levels of accuracy. With both CNNs (ResNet-50) and Transformers (BERT-base), Neural Magic has pruned 95% of weights while maintaining 99% of the accuracy as the baseline models.

What is Quantization?

Quantization is a technique to reduce computation and memory usage by converting the parameters and activations of a model from a high precision format like FP32 (which is the default for a deep learning model) to a low precision format like INT8.

By using lower precision, runtimes can reduce memory footprint and perform operations like matrix multiply faster. Additionally, quantization can be combined with unstructured pruning to gain additional speedup, a concept we call Compound Sparsity.

Training-Aware Algorithms

Broadly, there are two ways that pruning and quantization can be applied to a model:

  1. Post-Training - this is where sparsity is applied in one-pass with no training data
  2. Training Aware - this is where sparsity is applied gradually and the non-zero weights are adjusted with training data

Post-Training pruning and quantization optimizations are easier to apply to a model. However, these techniques often create signficant drops in accuracy, as the model does not have a chance to re-adjust to the optimization space.

Training-Aware pruning and quantization, by contrast, require setting up a training pipeline and implementing complex algorithms. However, applying the pruning and quantization gradually and fine-tuning the non-zero weights with training data enables accuracy to recover to 99% of the baseline dense model even as sparsity is pushed to very high levels.

SparseML uses Training-Aware Unstructured Pruning and Training-Aware Quantization to create very sparse models that sacrifice very little accuracy.

How to Create an Inference-Optimized Sparse Model

SparseML and SparseZoo extend PyTorch and TensorFlow with features for creating sparse models trained on custom data.

Together, they enable two workflows:

  1. Sparse Transfer Learning: fine-tune a pre-sparsified foundation model (like ResNet-50 or BERT) from the SparseZoo onto a custom dataset
  2. Sparsification From Scratch: apply training-aware pruning and quantization algorithms to any trained PyTorch, TensorFlow, and Hugging Face model, with fine-grained control of hyperparameters

Sparse Transfer Learning

Sparse Transfer Learning is the easiest path to creating a sparse model trained on custom data and is preferred for any scenario where a pre-sparsified foundation model exists in SparseZoo.

Neural Magic's research team has invested many hours in creating state-of-the-art pruned and quantized verisons of popular foundation models trained on large open datasets. These state-of-the-art models (including the hyperparameters of the sparsification process) are publically available in the SparseZoo.

SparseML enables users to fine-tune the pre-sparsified models in SparseZoo onto custom data while maintaining the same level of sparsity (which we call "Sparse Transfer Learning"). Under the hood, SparseML extends PyTorch and TensorFlow to only update non-zero weights during backprogation. Users, then, can Sparse Transfer Learn with just a single CLI command or five additional lines of code around a custom PyTorch training loop.

This means that any engineer (without deep knowledge of cutting-edge sparsity algorithms) can easily create accurate, inference-optimized sparse models for their specific use cases.

Additional Resources

Sparsification From Scratch

Sparsification From Scratch can be applied to any model, providing power-users a path to create sparse versions of any model.

As described in the conceptual section above, Training-Aware Unstructured Pruning and Training-Aware Quantization are the best techniques for creating models with the highest levels of sparsity without suffering from much accuracy degradation.

Gradual Magnitude Pruning (GMP) is the best algorithm for unstructured pruning:

  • With GMP, pruning occurs gradually over a training run. Over several epochs or training steps the least impactful weights are iteratively removed. The non-zero weights are then fine-tuned to the objective function. This iterative process enables the model to adjust to a new optimization space after pathways are removed before pruning again.

Quantization Aware Training (QAT) is the best algorithm for quantization:

  • With QAT, fake quantization operators are injected into the graph before quantizable nodes for activations, and weights are wrapped with fake quantization operators. The fake quantization operators interpolate the weights and activations down to INT8 on the forward pass but enable a full update of the weights at FP32 on the backward pass. The updates to the weights at FP32 throughout the training process allow the model to adapt to the loss of information from quantization on the forward pass.

Applying these algorithms correctly in an ad-hoc way is challenging. As such, Neural Magic created SparseML, which implements these algorithms on top of PyTorch and TensorFlow.

Using SparseML, users can apply these algorithms to their trained PyTorch and TensorFlow models with just five additional lines of code around a training loop. This enables ML Engineers to shift focus and time from (re)building sparsity algorithms to running experiments and tuning hyperparameters of the pruning and quantization process.

Ultimately, creating a sparse model from scratch is a form of architecture search. This is an inherently “research-like” exercise, which requires tuning hyperparameters of GMP and QAT and running experiments to test accuracy with various changes to the model. SparseML dramatically increases the productivity of developers running these experiements.

Additional Resources

Neural Magic Documentation
Deploy on CPUs