Neural Magic LogoNeural Magic Logo
DeepSparse Enterprise

tool icon   DeepSparse Enterprise

An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application

DeepSparse is a CPU inference runtime that takes advantage of sparsity within neural networks to execute inference quickly. Coupled with SparseML, an open-source optimization library, DeepSparse enables you to achieve GPU-class performance on commodity hardware.


DeepSparse is available in two editions:

  1. DeepSparse Community is free for evaluation, research, and non-production use with our DeepSparse Community License.
  2. DeepSparse Enterprise requires a trial license or can be fully licensed for production, commercial applications.

Install via PyPI

Install DeepSparse Enterprise with pip. We recommend using a virtual enviornment.

pip install deepsparse-ent

See the DeepSparse Enterprise Installation page for further installation options.

Getting a License

DeepSparse Enterprise requires a valid license to run the engine and can be licensed for production, commercial applications. There are two options available:

90-Day Enterprise Trial License

To try out DeepSparse Enterprise and get a Neural Magic Trial License, complete our registration form. Upon submission, the license will be emailed to you and your 90-day term starts right then.

DeepSparse Enterprise License

To learn more about DeepSparse Enterprise pricing, contact our Sales team to discuss your use case further for a custom quote.

Installing a License

Once you have obtained a license, you will need to initialize it to be able to run DeepSparse Enterprise. You can initialize your license by running:

deepsparse.license <license_string> or <path/to/license.txt>

To initialize a license on a machine:

  1. Confirm you have deepsparse-ent installed in a fresh virtual environment.
    • Installing deepsparse and deepsparse-ent on the same virtual environment may result in unsupported behaviors.
  2. Run deepsparse.license with the <license_string> or path/to/license.txt as an argument:
    • deepsparse.license <samplelicensetring>
    • deepsparse.license ./license.txt
  3. If successful, deepsparse.license will write the license file to ~/.config/neuralmagic/license.txt. You may overwrite this path by setting the environment variable NM_CONFIG_DIR (before and after running the script) with the following command:
    • export NM_CONFIG_DIR=path/to/license.txt
  4. Once the license is authenticated, you should see a splash message indicating that you are now running DeepSparse Enterprise.

If you encounter issues initializing your DeepSparse Enterprise License, contact [email protected] for help.

Validating a License

Once you have initialized your license, you may want to check that it is still valid before running a workload on DeepSparse Enterprise. To confirm your license is still active with DeepSparse Enterprise, run the command:


deepsparse.validate_license can be run with no arguments, which will reference an existing environment variable (if set), or with one argument that is a reference to the license and can be referenced in the deepsparse.validate_license command as path/to/license.txt.

To validate a license on a machine:

  1. If you have successfully run deepsparse.license, deepsparse.validate_license can be used to validate that the license file is in the correct location:
    • Run the deepsparse.validate_license with no arguments. If the referenced license is valid, the DeepSparse Enterprise splash screen should display in your terminal window.
    • If the NM_CONFIG_DIR environment variable was set when creating the license, ensure this variable is still set to the same value.
  2. If you want to supply the path/to/license.txt:
    • Run deepsparse.validate_license with path/to/license.txt as an argument as: deepsparse.validate_license --license_path path/to/license.txt
    • If the referenced license is valid, the DeepSparse Enterprise splash screen should display in your terminal window.

Deployment APIs

DeepSparse includes three deployment APIs:

  • Engine is the lowest-level API. With Engine, you pass tensors and receive the raw logits.
  • Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
  • Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.


The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input.

1from deepsparse import Engine
2from deepsparse.utils import generate_random_inputs, model_to_path
4# download onnx, compile
5zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
6batch_size = 1
7compiled_model = Engine(model=zoo_stub, batch_size=batch_size)
9# run inference (input is raw numpy tensors, output is raw scores)
10inputs = generate_random_inputs(model_to_path(zoo_stub), batch_size)
11output = compiled_model(inputs)
14# > [array([[-0.3380675 , 0.09602544]], dtype=float32)] << raw scores

DeepSparse Pipelines

Pipeline is the default API for interacting with DeepSparse. Similar to Hugging Face Pipelines, DeepSparse Pipelines wrap Engine with pre- and post-processing (as well as other utilities), enabling you to send raw data to DeepSparse and receive the post-processed prediction.

The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.

1from deepsparse import Pipeline
3# download onnx, set up pipeline
4zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
5sentiment_analysis_pipeline = Pipeline.create(
6 task="sentiment-analysis", # name of the task
7 model_path=zoo_stub, # zoo stub or path to local onnx file
10# run inference (input is a sentence, output is the prediction)
11prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
13# > labels=['positive'] scores=[0.9954759478569031]

Additional Resources

DeepSparse Server

Server wraps Pipelines with REST APIs, enabling you to stand up model serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions.

DeepSparse Server is launched from the command line, configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:

1deepsparse.server \
2 --task sentiment-analysis \
3 --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

Sending a request:

1import requests
3url = "http://localhost:5543/predict" # Server's port default to 5543
4obj = {"sequences": "Snorlax loves my Tesla!"}
6response =, json=obj)
8# {"labels":["positive"],"scores":[0.9965094327926636]}

Additional Resources


DeepSparse accepts models in the ONNX format. ONNX models can be passed in one of two ways:

  • SparseZoo Stub: SparseZoo is an open-source repository of sparse models. The examples on this page use SparseZoo stubs to identify models and download them for deployment in DeepSparse.

  • Local ONNX File: Users can provide their own ONNX models, whether dense or sparse. For example:

1from deepsparse import Engine
2from deepsparse.utils import generate_random_inputs
3onnx_filepath = "mobilenetv2-7.onnx"
4batch_size = 16
6# Generate random sample input
7inputs = generate_random_inputs(onnx_filepath, batch_size)
9# Compile and run
10compiled_model = Engine(model=onnx_filepath, batch_size=batch_size)
11outputs = compiled_model(inputs)
13# (16, 1000) << batch, num_classes

Inference Modes

DeepSparse offers different inference scenarios based on your use case.

Single-stream scheduling: the latency/synchronous scenario, requests execute serially. [default]

single stream diagram

It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.

Multi-stream scheduling: the throughput/asynchronous scenario, requests execute in parallel.

multi stream diagram

The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them.

Product Usage Analytics

DeepSparse Community Edition gathers basic usage telemetry including, but not limited to, Invocations, Package, Version, and IP Address for Product Usage Analytics purposes. Review Neural Magic's Products Privacy Policy for further details on how we process this data.

To disable Product Usage Analytics, run the command:


Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check." For additional assistance, reach out through the DeepSparse GitHub Issue queue.

Additional Resources




Be Part of the Future... And the Future is Sparse!

Contribute with code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

For user help or questions about DeepSparse, sign up or log in to our Neural Magic Community Slack. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue. You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.

For more general questions about Neural Magic, complete this form.


DeepSparse Community is licensed under the Neural Magic DeepSparse Community License. Some source code, example files, and scripts included in the deepsparse GitHub repository or directory are licensed under the Apache License Version 2.0 as noted.

DeepSparse Enterprise requires a Trial License or can be fully licensed for production, commercial applications.


Find this project useful in your research or other communications? Please consider citing:

2 pmlr-v119-kurtz20a,
3 title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks},
4 author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan},
5 booktitle = {Proceedings of the 37th International Conference on Machine Learning},
6 pages = {5533--5543},
7 year = {2020},
8 editor = {Hal Daumé III and Aarti Singh},
9 volume = {119},
10 series = {Proceedings of Machine Learning Research},
11 address = {Virtual},
12 month = {13--18 Jul},
13 publisher = {PMLR},
14 pdf = {},
15 url = {}
19 author = {Eugenia Iofinova and
20 Alexandra Peste and
21 Mark Kurtz and
22 Dan Alistarh},
23 title = {How Well Do Sparse Imagenet Models Transfer?},
24 journal = {CoRR},
25 volume = {abs/2111.13445},
26 year = {2021},
27 url = {},
28 eprinttype = {arXiv},
29 eprint = {2111.13445},
30 timestamp = {Wed, 01 Dec 2021 15:16:43 +0100},
31 biburl = {},
32 bibsource = {dblp computer science bibliography,}
DeepSparse C++ API
DeepSparse CLI