The Neural Magic Platform enables you to optimize your models for inference with sparsity.
There are multiple factors to consider when creating a deep learning model. In training, accuracy on the test-set is the primary metric. In deployment, however, the performance (latency/throughput) of the model becomes an important consideration at production scale.
However, as deep learning has exploded and state-of-the-art models have grown bigger and bigger, performance and accuracy have been increasingly at odds.
SparseML and SparseZoo work together to help users create performance-optimized models while mimizing accuracy loss, using sparsification techniques called pruning and quantization.
Importantly, they support training-aware pruning and quantization algorithms (as well as post-training). Training-aware techniques apply the sparsification gradually, allowing the model to adjust by fine-tuning the remaining weights with the training dataset at each step. This technique is critical to maintain high accuracy at the high levels of sparsity needed to reach GPU-class performance.
Pruning is the process of removing weights from a trained deep learning model by setting them to zero. Pruning can speed up a model, because inference runtimes implement optimizations that "skip" the multiply-adds by zero, reducing the needed computation.
There are two types of pruning that can be applied to a model:
With Structured Pruning, it is easy for an inference runtime to include optimizations that speed-up the model and most runtimes will benefit from this type of pruning. However, structured pruning can have large negative impacts on accuracy of the model, especially at the high levels of sparsity needed to see speedups.
With Unstructured Pruning, it is very hard for an inference runtime to include optimizations that speed-up the model
(as far as we know, DeepSparse is the only production-grade runtime focused on speed-ups from unstructured pruning). The
benefit of unstructured pruning, however, is that sparsity can be pushed to very high levels while maintaining high levels of
accuracy. With both CNNs (ResNet-50
) and Transformers (BERT-base
), Neural Magic has pruned 95% of weights
while maintaining 99% of the accuracy as the baseline models.
Quantization is a technique to reduce computation and memory usage by converting the parameters and activations of a model from a high precision format like FP32 (which is the default for a deep learning model) to a low precision format like INT8.
By using lower precision, runtimes can reduce memory footprint and perform operations like matrix multiply faster. Additionally, quantization can be combined with unstructured pruning to gain additional speedup, a concept we call Compound Sparsity.
Broadly, there are two ways that pruning and quantization can be applied to a model:
Post-Training pruning and quantization optimizations are easier to apply to a model. However, these techniques often create signficant drops in accuracy, as the model does not have a chance to re-adjust to the optimization space.
Training-Aware pruning and quantization, by contrast, require setting up a training pipeline and implementing complex algorithms. However, applying the pruning and quantization gradually and fine-tuning the non-zero weights with training data enables accuracy to recover to 99% of the baseline dense model even as sparsity is pushed to very high levels.
SparseML uses Training-Aware Unstructured Pruning and Training-Aware Quantization to create very sparse models that sacrifice very little accuracy.
SparseML and SparseZoo extend PyTorch and TensorFlow with features for creating sparse models trained on custom data.
Together, they enable two workflows:
Sparse Transfer Learning is the easiest path to creating a sparse model trained on custom data and is preferred for any scenario where a pre-sparsified foundation model exists in SparseZoo.
Neural Magic's research team has invested many hours in creating state-of-the-art pruned and quantized verisons of popular foundation models trained on large open datasets. These state-of-the-art models (including the hyperparameters of the sparsification process) are publically available in the SparseZoo.
SparseML enables users to fine-tune the pre-sparsified models in SparseZoo onto custom data while maintaining the same level of sparsity (which we call "Sparse Transfer Learning"). Under the hood, SparseML extends PyTorch and TensorFlow to only update non-zero weights during backprogation. Users, then, can Sparse Transfer Learn with just a single CLI command or five additional lines of code around a custom PyTorch training loop.
This means that any engineer (without deep knowledge of cutting-edge sparsity algorithms) can easily create accurate, inference-optimized sparse models for their specific use cases.
Additional Resources
Sparsification From Scratch can be applied to any model, providing power-users a path to create sparse versions of any model.
As described in the conceptual section above, Training-Aware Unstructured Pruning and Training-Aware Quantization are the best techniques for creating models with the highest levels of sparsity without suffering from much accuracy degradation.
Gradual Magnitude Pruning (GMP) is the best algorithm for unstructured pruning:
Quantization-Aware Training (QAT) is the best algorithm for quantization:
Applying these algorithms correctly in an ad-hoc way is challenging. As such, Neural Magic created SparseML, which implements these algorithms on top of PyTorch and TensorFlow.
Using SparseML, users can apply these algorithms to their trained PyTorch and TensorFlow models with just five additional lines of code around a training loop. This enables ML Engineers to shift focus and time from (re)building sparsity algorithms to running experiments and tuning hyperparameters of the pruning and quantization process.
Ultimately, creating a sparse model from scratch is a form of architecture search. This is an inherently “research-like” exercise, which requires tuning hyperparameters of GMP and QAT and running experiments to test accuracy with various changes to the model. SparseML dramatically increases the productivity of developers running these experiements.