DeepSparse 0.3

Neural network inference engine that delivers GPU-class performance for sparsified models on CPUs

Overview

The DeepSparse Engine is a CPU runtime that delivers GPU-class performance by taking advantage of sparsity within neural networks to reduce compute required as well as accelerate memory bound workloads. It is focused on model deployment and scaling machine learning pipelines, fitting seamlessly into your existing deployments as an inference backend.

This repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models.

Sparsification

Sparsification is the process of taking a trained deep learning model and removing redundant information from the overprecise and over-parameterized network resulting in a faster and smaller model. Techniques for sparsification are all encompassing including everything from inducing sparsity using pruning and quantization to enabling naturally occurring sparsity using activation sparsity or winograd/FFT. When implemented correctly, these techniques result in significantly more performant and smaller models with limited to no effect on the baseline metrics. For example, pruning plus quantization can give noticeable improvements in performance while recovering to nearly the same baseline accuracy.

The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches. Recipes encode the directions for how to sparsify a model into a simple, easily editable format. - Download a sparsification recipe and sparsified model from the SparseZoo. - Alternatively, create a recipe for your model using Sparsify. - Apply your recipe with only a few lines of code using SparseML. - Finally, for GPU-level performance on CPUs, deploy your sparse-quantized model with the DeepSparse Engine.

Full Deep Sparse product flow:

Compatibility

The DeepSparse Engine ingests models in the ONNX format, allowing for compatibility with PyTorch, TensorFlow, Keras, and many other frameworks that support it. This reduces the extra work of preparing your trained model for inference to just one step of exporting.