The Neural Magic Platform enables you to deploy models performantly on CPUs.
Because DeepSparse reaches GPU-class performance with commodity CPUs, you no longer need to tie deployments to accelerators to reach the performance needed for production.
Free from specialized hardware, deployments can take advantage of the flexibility and scalability of software-defined inference:
Simply put, with DeepSparse on CPUs, you can both simplify your deep learning deployment process and save on infrastructure costs required to support enterprise-scale workloads.
DeepSparse achieves its performance using breakthrough algorithms to accelerate the computation. There are two high level ideas that underpin the system:
First, DeepSparse is "sparsity-aware". This means we have implementations of common neural network operations that take advantage of structured and unstructured sparsity. Because the locations of the 0 weights in a sparse model are known at compile time, DeepSparse can "skip" the multiply-adds by 0. This reduces the number of instructions significantly and the computation becomes memory-bound.
Second, DeepSparse takes advantage of the large caches in CPUs. DeepSparse identifies and breaks down the computational graph into into depth-wise chunks called "tensor-columns" that can be executed in parallel across many CPU-cores. This pattern has much better locality of reference in comparison to traditional layer-by-layer execution. In this way, DeepSparse minimizes data movement in-and-out of the large caches in a CPU, which is the performance bottleneck in a memory-bound system.
These two ideas sum up to GPU-class performance on commodity CPUs! As far as we know, DeepSparse is the only production-grade runtime that focuses on speedups from unstructured sparsity. The unstructured sparsity optimizations are hard to implement but are an important unlock, because, as discussed before, unstructured pruning allows us to reach the high levels of sparsity needed to see the performance gains without sacrificing accuracy.
Beyond all GPU-class performance and benefits of the scalability of CPU-only deployments, DeepSparse also wraps the runtime with APIs and utilites that simplify the process of adding inference to an application and monitoring a model in production.
DeepSparse Pipelines are Python APIs which wrap the runtime with prewritten pre-processing and post-processing utilities that make it easy to call the invoked model from within an application. For NLP, this means you can pass strings to DeepSparse and receive back predictions. For Object Detection, this means you pass a raw image to DeepSparse and get back bounding boxes after NMS has been applied.
DeepSparse supports the following use cases out of the box:
We are continually adding more use cases. Additionally, DeepSparse offers a CustomTaskPipeline which allows users to add custom pre-processing and custom post-processing for unsupported use cases in a consistent way.
Want a new use case? Reach out in our Community Slack.
Built on FastAPI and uvicorn, DeepSparse Server is a wrapper around DeepSparse Pipelines that enable you to invoke inference via REST APIs. This means that you can create a model-serving endpoint running DeepSparse in the cloud and datacenter with just a single command line call. Additionally, because DeepSparse Server is CPU-only, a model dervice with DeepSparse can be easily scaled up and down elastically with Kubernetes, can run on Serverless services like Lambda and Cloud Run, and is intergrated with managed service endpoints like SageMaker and Hugging Face Endpoints.
DeepSparse has multiple modes that allow you to tune a deployment. Examples include:
DeepSparse has utilities that make it easy to handle dynamic inputs. Examples include:
DeepSparse has capabilities to support MLOps related monitoring. Examples include:
All this means that DeepSparse is not only fast and CPU-only, but also easy to add to your application. With DeepSparse, you can spend less time writing scaffolding-code and focus more on building a great system.