Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparse EngineSparseMLSparseZoo
Home
Deploy on CPUs

Deploy on CPUs

The Neural Magic Platform enables you to deploy models performantly on CPUs.

Benefits of CPU-deployments

Because DeepSparse reaches GPU-class performance with commodity CPUs, you no longer need to tie deployments to accelerators to reach the performance needed for production.

Free from specialized hardware, deployments can take advantage of the flexibility and scalability of software-defined inference:

  • Deploy the same model and runtime on any hardware from Intel to AMD to ARM and from cloud to data center to edge, including on pre-existing systems
  • Scale vertically from 1 to 192 cores, tailoring the footprint to an app's exact needs
  • Scale horizontally with standard Kubernetes, including using services like EKS/GKE
  • Scale abstractly with serverless instances like GCP Cloud Run and AWS Lambda
  • Integrate easily into "Deploy with code" provisioning systems
  • No wrestling with drivers, operator support, and compatibility issues

Simply put, with DeepSparse on CPUs, you can both simplify your deep learning deployment process and save on infrastructure costs required to support enterprise-scale workloads.

How DeepSparse Works

DeepSparse achieves its performance using breakthrough algorithms to accelerate the computation. There are two high level ideas that underpin the system:

  • First, DeepSparse is "sparsity-aware". This means we have implementations of common neural network operations that take advantage of structured and unstructured sparsity. Because the locations of the 0 weights in a sparse model are known at compile time, DeepSparse can "skip" the multiply-adds by 0. This reduces the number of instructions significantly and the computation becomes memory-bound.

  • Second, DeepSparse takes advantage of the large caches in CPUs. DeepSparse identifies and breaks down the computational graph into into depth-wise chunks called "tensor-columns" that can be executed in parallel across many CPU-cores. This pattern has much better locality of reference in comparison to traditional layer-by-layer execution. In this way, DeepSparse minimizes data movement in-and-out of the large caches in a CPU, which is the performance bottleneck in a memory-bound system.

These two ideas sum up to GPU-class performance on commodity CPUs! As far as we know, DeepSparse is the only production-grade runtime that focuses on speedups from unstructured sparsity. The unstructured sparsity optimizations are hard to implement but are an important unlock, because, as discussed before, unstructured pruning allows us to reach the high levels of sparsity needed to see the performance gains without sacrificing accuracy.

Additional Resources

DeepSparse

Beyond all GPU-class performance and benefits of the scalability of CPU-only deployments, DeepSparse also wraps the runtime with APIs and utilites that simplify the process of adding inference to an application and monitoring a model in production.

DeepSparse Pipelines

DeepSparse Pipelines are Python APIs which wrap the runtime with prewritten pre-processing and post-processing utilities that make it easy to call the invoked model from within an application. For NLP, this means you can pass strings to DeepSparse and receive back predictions. For Object Detection, this means you pass a raw image to DeepSparse and get back bounding boxes after NMS has been applied.

DeepSparse supports the following use cases out of the box:

  • CV: Image Classification
  • CV: Object Detection
  • CV: Segmentation
  • NLP: Sentiment Analysis
  • NLP: Text Classification
  • NLP: Token Classification
  • NLP: Document Classification
  • NLP: Extractive Question Answering
  • NLP: HayStack Information Retrieval
  • Embedding Extraction

We are continually adding more use cases. Additonally, DeepSparse offers a CustomTaskPipeline which allows users to add custom pre-processing and custom post-processing for unsupported use cases in a consistent way.

Want a new use case? Reach out in our Community Slack.

DeepSparse Server

Built on FastAPI and uvicorn, DeepSparse Server is a wrapper around DeepSparse Pipelines that enable you to invoke inference via REST APIs. This means that you can create a model-serving endpoint running DeepSparse in the cloud and datacenter with just a single command line call. Additionally, because DeepSparse Server is CPU-only, a model dervice with DeepSparse can be easily scaled up and down elastically with Kubernetes, can run on Serverless services like Lambda and CloudRun, and is intergrated with managed service endpoints like SageMaker and Hugging Face Endpoints.

Additional Features

DeepSparse has multiple modes that allow you to tune a deployment. Examples include:

  • Synchronous Scheduling: minimize latency by using all cores on a single input
  • Asynchronous Scheduling: control the number of streams that can be executed simultaenously for handling multiple clients
  • Benchmarking: tools to compare performance and tune configurations

DeepSparse has utilities that make it easy to handle dynamic inputs. Examples include:

  • Dynamic Batch: handle various batch sizes without needing to recomplile the model
  • Bucketing: handle NLP sequences of variable length without padding to max_seq_len

DeepSparse has capabilities to support MLOps related monitoring. Examples include:

  • System Logging: monitor granular request latency data with Prometheus and Grafana
  • Data Logging: log input and output data (and projections thereof) for use with data drift detection or retraining

All this means that DeepSparse is not only fast and CPU-only, but also easy to add to your application. With DeepSparse, you can spend less time writing scaffolding-code and focus more on building a great system.

We love to hear feature requests in our Community Slack!

Additional Resources

Optimize for Inference
Quick Tour