Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparse EngineSparseMLSparseZoo
Home

Neural Magic Platform Documentation

Neural Magic enables you to deploy deep learning models on commodity CPUs with GPU-class performance.

Why Deploy on CPUs?

CPU-based deep learning deployments on commodity hardware are flexible and scalable.

Because DeepSparse reaches GPU-class performance with commodity CPUs, users no longer need to tether deployments to accelerators to reach the performance needed for production. Free from specialized hardware, deployments can take advantage of the flexibility and scalability of software-defined inference:

  • Deploy the same model and runtime on any hardware from Intel to AMD to ARM and from cloud to data center to edge, including on pre-existing systems
  • Scale vertically from 1 to 192 cores, tailoring the footprint to an app's exact needs
  • Scale horizontally with standard Kubernetes, including using services like EKS/GKE
  • Scale abstractly with serverless instances like GCP Cloud Run and AWS Lambda
  • Integrate easily into "Deploy with code" provisioning systems
  • No wrestling with drivers, operator support, and compatibility issues

Simply put, deep learning deployments no longer need to choose between the performance of GPUs and simplicty of software!

Neural Magic Platform

The Neural Magic Platform enables two major workflows.

1. Optimize a Model for Inference

SparseML and SparseZoo work together to optimize models for inference with techniques like pruning and quantization (which we call "sparsity").

  • SparseML is an open-source library that extends PyTorch and TensorFlow to simplify the process of applying sparsity algorithms. Via simple CLI scripts or five lines of code, users can sparsify any model from scratch or sparse transfer learn from pre-sparsified versions of foundation models like ResNet, YOLOv5, or BERT.

  • SparseZoo is an open-source repository of pre-sparsified models (for example, sparse ResNet-50 has 95% of weights set to 0 while maintaining 99% of the baseline accuracy). SparseZoo is integrated with SparseML, making it trival for users to fine-tune from sparse model (which we call "Sparse Transfer Learning") onto their data.

2. Deploy a Model on CPUs

DeepSparse runs inference-optimized sparse models with GPU-class performance on CPUs.

  • DeepSparse is an inference runtime offering GPU class performance on CPUs and APIs for integrating ML into an application. When running an inference-optimized sparse model, DeepSparse on commodity CPUs achieves better latency than a NVIDIA T4 (the most common GPU for inference) and an order of magnitude more throughput than ONNX Runtime. As a result, it offers the best price-performance for deep learning deployments.

Docs Content

The documentation is organized into several sections:

  • GET STARTED provides install instructions and a tour of major functionality
  • USE CASES walks through detailed examples using SparseML and DeepSparse
  • USER GUIDE shows more advanced functionality with specific tasks
  • PRODUCTS provides API-level docs for all classes and functions
  • DETAILS includes research papers and a glossary of terms

Not Sure Where to Start?

External Resources

✅ Join our community if you need any help or subscribe for regular email updates.

✅ Check out our GitHub repositories and give us a ⭐.

✅ Help us improve this documentation.

Optimize for Inference