What is Neural Magic?
Neural Magic was founded by a team of award-winning MIT computer scientists and is funded by Amdocs, Andreessen Horowitz, Comcast Ventures, NEA, Pillar VC, Ridgeline Partners, Verizon Ventures, and VMWare. The Neural Magic Platform includes several components, including DeepSparse, SparseML, and SparseZoo. DeepSparse is an inference runtime offering GPU-class performance on CPUs and tooling to integrate ML into your application. SparseML and SparseZoo, are and open-source tooling and model repository combination that enable you to create an inference-optimized sparse-model for deployment with DeepSparse.
Together, these components remove the tradeoff between performance and the simplicity and scalability of software-delivered deployments.
What is DeepSparse?
DeepSparse, created by Neural Magic, is an inference runtime for deep learning models. It delivers state of art, GPU-class performance on commodity CPUs as well as tooling for integrating a model into an application and monitoring models in production.
Why Neural Magic?
Learn more about Neural Magic and DeepSparse (formerly known as the Neural Magic Inference Engine). Watch the Why Neural Magic video
How does Neural Magic make it work?
This is an older webinar (50m) where we went through the process of optimizing and deploying a model; we’ve enhanced our software since the recording went out but this will give you some background: Watch the How does it Work video
Does Neural Magic support training of learning models on CPUs?
Neural Magic does not support training of deep learning models at this time. We do see value in providing a consistent CPU environment for our end users to train and infer on for their deep learning needs, and have added this to our engineering backlog.
Do you have version compatibility on TensorFlow?
Our inference engine supports all versions of TensorFlow <= 2.0; support for the Keras API is through TensorFlow 2.0.
Do you run on AMD hardware?
DeepSparse is validated to work on x86 Intel (Haswell generation and later) and AMD CPUs running Linux, with support for AVX2, AVX-512, and VNNI instruction sets. Specific support details for some algorithms over different microarchitectures is available.
We are open to opportunities to expand our support footprint for different CPU-based processor architectures, based on market adoption and deep learning use cases.
Do you run on ARM architecture?
We are actively working on ARM support and it’s slated for general availability in 2023. We would like to hear your use cases and keep you in the loop! Contact us to continue the conversation.
To what use cases is the Neural Magic Platform best suited?
We focus on the models and use cases related to computer vision and NLP due to cost sensitivity and both real time and throughput constraints. The belief now is GPUs are required for deployment.
What types of models does Neural Magic support?
Today, we offer support for CNN-based computer vision models, specifically classification and object detection model types. NLP models like BERT are also available. We are continuously adding models to our supported model list and SparseZoo. Additionally, we are investigating model architectures beyond computer vision.
Is dynamic shape supported?
Dynamic shape is currently not supported; be sure to use models with fixed inputs and compile the model for a particular batch size. Dynamic shape and dynamic batch sizes are on the Neural Magic roadmap; subscribe for updates.
Can multiple model inferences be executed?
Model inferences are executed as a single stream by default; concurrent execution can be enabled depending on the engine execution strategy.
Do you have benchmarks to compare and contrast?
Yes. Check out our benchmark demo video or contact us to discuss your particular performance requirements. If you’d rather observe performance for yourself, head over to the Neural Magic GitHub repo to check out our tools and generate your own benchmarks in your environment.
Do you publish ML Perf inference benchmarks?
Checkout ZDNet's coverage of our results at ML Perf!
Which instruction sets are supported and do we have to enable certain settings?
AVX2, AVX-512, and VNNI. DeepSparse will automatically utilize the most effective available instructions for the task. Depending on your goals and hardware priorities, optimal performance can be found. Neural Magic is happy to discuss your use cases and offer recommendations.
Are you suitable for edge deployments (i.e., in-store devices, cameras)?
Yes, absolutely. We can run anywhere you have a CPU with x86 instructions, including on bare metal, in the cloud, on-prem, or at the edge. Additionally, our model optimization tools are able to reduce the footprint of models across all architectures. We only guarantee performance in DeepSparse.
We’d love to hear from users highly interested in ML performance. If you want to chat about your use cases or how others are leveraging the Neural Magic Platform, please contact us. Or simply head over to the Neural Magic GitHub repo and check out our tools.
Do you have available solutions or applications on the Microsoft/Azure platform?
We deploy extremely easily. We are completely infrastructure-agnostic. As long as it has the “right” CPUs (e.g., AVX2 or AVX-512) we can run on any cloud platform, including Azure!
Can the inference engine run on Kubernetes? How do you containerize and take advantage of underlying infrastructure?
DeepSparse becomes a component of your model serving solution. As a result, it can simply plug into an existing CI/CD deployment pipeline. How you deploy, where you deploy, and what you deploy on becomes abstracted to DeepSparse so you can tailor your experiences. For example, you can run the DeepSparse on a CPU VM environment, deployed via a Docker file and managed through a Kubernetes environment.
Can you comment on how you do pruning and effects on accuracy?
Neural networks are extremely over-parameterized, allowing most weights to be iteratively removed from the network without effect on accuracy. Eventually, though, pruning will begin affecting the overall capacity of the network, the degree of which varies based on the use case. However, this is something entirely under the data scientist control to choose whether to recover fully or to prune more for even better performance.
For example, Neural Magic has been successful in removing 95% of ResNet-50 weights with no loss in accuracy. For more background on techniques that have informed our methodologies, check out this paper co-written by Neural Magic, WoodFisher: Efficient Second-Order Approximation for Neural Network Compression.
When does sparsification actually happen?
In a scenario in which you want to sparsify and then run your own model with DeepSparse, you would first sparsify your model to achieve the desired level of performance and accuracy using Neural Magic’s SparseML tooling.
What does the sparsification process look like?
Neural Magic’s Sparsify and SparseML tooling, at its core, uses well-established state-of-the-art research principles such as Gradual Magnitude Pruning (GMP) to sparsify models. This is an iterative process in which groups of important weights are pruned away and then the network is allowed to recover. To significantly simplify the process, we offer tools and guidance for you to achieve the best performance possible. To peruse research papers contributed by Neural Magic staff, check them out. Or head over to the Neural Magic GitHub repo to get started!
How does sparsification work in relation to TensorFlow?
Today, we are able to sparsify models trained in popular deep learning libraries like TensorFlow. Our unique approach works with the output supplied by the model library and provides layer sparsification techniques that then can be compiled in the existing library framework, within the user environment.
When using your software to transfer learn, what about other hyperparameters? Are you just freezing other layers?
For transfer learning, our tooling allows you to save the sparse architecture learned from larger datasets. Other hyperparameters are fully under your control and allow you the flexibility to easily freeze layers as well.
Do you support INT8 and INT16 (quantized) operations?
DeepSparse runs at FP32 and has support for INT8. With Intel Cascade Lake generation chips and later, Intel CPUs include VNNI instructions and support both INT8 and INT16 operations. On these machines, performance improvements from quantization will be greater. DeepSparse has INT8 support for the ONNX operators QLinearConv, QuantizeLinear, DequantizeLinear, QLinearMatMul, and MatMulInteger. Our engine also supports 8-bit QLinearAdd, an ONNX Runtime custom operator.
Do you support FP16 (half precision) and BF16 operations?
Neural Magic is looking to include both FP16 and BF16 on our roadmap in the near future.
Do users have to do any model conversion before using DeepSparse?
DeepSparse executes on an ONNX (Open Neural Network Exchange) representation of a deep learning model. Our software allows you to produce an ONNX representation. If working with PyTorch, we use the built-in ONNX export and for TensorFlow, we convert from a standard exported protobuf file to ONNX. Outside of those frameworks, you would need to convert your model to ONNX first before passing it to DeepSparse.
Why is ONNX the file format used by Neural Magic?
ONNX (Open Neural Network Exchange) is emerging as a standard, open-source format for model representation. Based on the breadth of vendors supporting ONNX as well as the health of open-source community contributions, we believe ONNX offers a compelling solution for the market.
Are your users using ONNX runtime already?
End users are using a wide variety of runtimes, both open source and proprietary. Neural Magic is focused on ensuring we are open and flexible, to allow our users to achieve deep learning performance regardless of how they choose to build, deploy, and run their models.
What is the accuracy loss, if any, on the numbers Neural Magic demonstrates?
Results will depend on your use case and specific requirements. We are capable of maintaining 100% baseline accuracy. In cases where accuracy is not as important as performance, you can use our model optimization tools to further speed up the model at the expense of accuracy and weigh the tradeoffs.
If you need sparsification, we provide the tooling for tradeoffs between accuracy and performance based on your specific requirements.
For the runtime engine, is Neural Magic modifying the architecture in any way or just optimizing the instruction set at that level?
Specifically for sparsification, our software keeps the architecture intact and changes the weights. For running dense, we do not change anything about the model.
For a CPU are you using all the cores?
DeepSparse optimizes how the model is run on the infrastructure resources applied to it. But, Neural Magic does not optimize for the number of cores. You are in control to specify how much of the system Neural Magic will use and run on. Depending on your goals (latency, throughput, and cost constraints), you can optimize your pipeline for maximum efficiency.