Neural Magic LogoNeural Magic Logo
DeepSparse EngineSparseMLSparseZoo
Enterprise Edition

tool icon Β Β DeepSparse Enterprise Edition

Sparsity-aware neural network inference engine for GPU-class performance on CPUs

A CPU runtime that takes advantage of sparsity within neural networks to reduce compute. Read more about sparsification.

Neural Magic's DeepSparse Engine is able to integrate into popular deep learning libraries (e.g., Hugging Face, Ultralytics) allowing you to leverage DeepSparse for loading and deploying sparse models with ONNX. ONNX gives the flexibility to serve your model in a framework-agnostic environment. Support includes PyTorch, TensorFlow, Keras, and many other frameworks.

The DeepSparse Engine is available in two editions:

  1. The Community Edition is open-source and free for evaluation, research, and non-production use with our Engine Community License.
  2. The Enterprise Edition requires a Trial License or can be fully licensed for production, commercial applications.


🧰 Hardware Support and System Requirements

Review Supported Hardware for the DeepSparse Engine to understand system requirements. The DeepSparse Engine works natively on Linux; Mac and Windows require running Linux in a Docker or virtual machine; it will not run natively on those operating systems.

The DeepSparse Engine is tested on Python 3.7-3.10, ONNX 1.5.0-1.12.0, ONNX opset version 11+, and manylinux compliant. Using a virtual environment is highly recommended.


Install the Enterprise Edition as follows:

pip install deepsparse-ent

See the DeepSparse Enterprise Installation Page for further installation options.

Getting a License

The DeepSparse Enterprise Edition requires a valid license to run the engine and can be licensed for production, commercial applications. There are two options available:

90-Day Enterprise Trial License

To try out the DeepSparse Enterprise Edition and get a Neural Magic Trial License, complete our registration form. Upon submission, the license will be emailed to you and your 90-day term starts right then.

Enterprise Edition License

To learn more about DeepSparse Enterprise Edition pricing, contact our Sales team to discuss your use case further for a custom quote.

Installing a License

Once you have obtained a license, you will need to initialize it to be able to run the DeepSparse Enterprise Edition. You can initialize your license by running the command:

deepsparse.license <license_string> or <path/to/license.txt>

To initialize a license on a machine:

  1. Confirm you have deepsparse-ent installed in a fresh virtual environment.
    • Note: Installing deepsparse and deepsparse-ent on the same virtual environment may result in unsupported behaviors.
  2. Run deepsparse.license with the <license_string> or path/to/license.txt as an argument as follows:
    • deepsparse.license <samplelicensetring>
    • deepsparse.license ./license.txt
  3. If successful, deepsparse.license will write the license file to ~/.config/neuralmagic/license.txt. You may overwrite this path by setting the environment variable NM_CONFIG_DIR (before and after running the script) with the following command:
    • export NM_CONFIG_DIR=path/to/license.txt
  4. Once the license is authenticated, you should see a splash message indicating that you are now running DeepSparse Enterprise Edition.

If you encounter issues initializing your DeepSparse Enterprise Edition License, contact [email protected] for help.

Validating a License

Once you have initialized your license, you may want to check if it is still valid before running a workload on DeepSparse Enterprise Edition. To confirm your license is still active with the DeepSparse Enterprise Edition, run the command:


deepsparse.validate_license can be run with no arguments, which will reference an existing environment variable (if set), or with one argument that is a reference to the license and can be referenced in the deepsparse.validate_license command as path/to/license.txt.

To validate a license on a machine:

  1. If you have successfully run deepsparse.license, deepsparse.validate_license can be used to validate that the license file is in the correct location:
    • Run the deepsparse.validate_license with no arguments. If the referenced license is valid, you should get the DeepSparse Enterprise Edition splash screen printed out in your terminal window.
    • If the NM_CONFIG_DIR environment variable was set when creating the license, ensure this variable is still set to the same value.
  2. If you want to supply the path/to/license.txt:
    • Run the deepsparse.validate_license with path/to/license.txt as an argument as follows:
    • deepsparse.validate_license --license_path path/to/license.txt
    • If the referenced license is valid, you should get the DeepSparse Enterprise Edition splash screen printed out in your terminal window.

If you encounter issues validating your DeepSparse Enterprise Edition License, contact [email protected] for help.


πŸ”Œ DeepSparse Server

The DeepSparse Server allows you to serve models and pipelines from the terminal. The server runs on top of the popular FastAPI web framework and Uvicorn web server. Install the server using the following command:

pip install deepsparse-ent[server]

Single Model

Once installed, the following example CLI command is available for running inference with a single BERT model:

1deepsparse.server \
2 task question_answering \
3 --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

To look up arguments run: deepsparse.server --help.

Multiple Models

To serve multiple models in your deployment you can easily build a config.yaml. In the example below, we define two BERT models in our configuration for the question answering task:

1num_cores: 1
2num_workers: 1
4 - task: question_answering
5 route: /predict/question_answering/base
6 model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none
7 batch_size: 1
8 - task: question_answering
9 route: /predict/question_answering/pruned_quant
10 model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
11 batch_size: 1

Finally, after your config.yaml file is built, run the server with the config file path as an argument:

deepsparse.server config config.yaml

Getting Started with the DeepSparse Server for more info.

πŸ“œ DeepSparse Benchmark

The benchmark tool is available on your CLI to run expressive model benchmarks on the DeepSparse Engine with minimal parameters.

Run deepsparse.benchmark -h to look up arguments:

1deepsparse.benchmark [-h] [-b BATCH_SIZE] [-shapes INPUT_SHAPES]
2 [-ncores NUM_CORES] [-s {async,sync}] [-t TIME]
3 [-nstreams NUM_STREAMS] [-pin {none,core,numa}]
4 [-q] [-x EXPORT_PATH]
5 model_path

Getting Started with CLI Benchmarking includes examples of select inference scenarios:

  • Synchronous (Single-stream) Scenario
  • Asynchronous (Multi-stream) Scenario

πŸ‘©β€πŸ’» NLP Inference Example

1from deepsparse import Pipeline
3# SparseZoo model stub or path to ONNX file
4model_path = "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"
6qa_pipeline = Pipeline.create(
7 task="question-answering",
8 model_path=model_path,
11my_name = qa_pipeline(question="What's my name?", context="My name is Snorlax")

NLP Tutorials:

Tasks Supported:

πŸ¦‰ SparseZoo ONNX vs. Custom ONNX Models

DeepSparse can accept ONNX models from two sources:

  • SparseZoo ONNX: our open-source collection of sparse models available for download. SparseZoo hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from SparseML.

  • Custom ONNX: your own ONNX model, can be dense or sparse. Plug in your model to compare performance with other solutions.

1Saving to: β€˜mobilenetv2-7.onnx’

Custom ONNX Benchmark example:

1from deepsparse import compile_model
2from deepsparse.utils import generate_random_inputs
3onnx_filepath = "mobilenetv2-7.onnx"
4batch_size = 16
6# Generate random sample input
7inputs = generate_random_inputs(onnx_filepath, batch_size)
9# Compile and run
10engine = compile_model(onnx_filepath, batch_size)
11outputs =

The GitHub repository includes package APIs along with examples to quickly get started benchmarking and inferencing sparse models.

Scheduling Single-Stream, Multi-Stream, and Elastic Inference

The DeepSparse Engine offers up to three types of inferences based on your use case. Read more details here: Inference Types.

1 ⚑ Single-stream scheduling: the latency/synchronous scenario, requests execute serially. [default]

single stream diagram

Use Case: It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.

2 ⚑ Multi-stream scheduling: the throughput/asynchronous scenario, requests execute in parallel.

multi stream diagram

PRO TIP: The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them.

3 ⚑ Elastic scheduling: requests execute in parallel, but not multiplexed on individual NUMA nodes.

Use Case: A workload that might benefit from the elastic scheduler is one in which multiple requests need to be handled simultaneously, but where performance is hindered when those requests have to share an L3 cache.






The Community Edition of the project's binary containing the DeepSparse Engine is licensed under the Neural Magic Engine License. Example files and scripts included in this repository are licensed under the Apache License Version 2.0 as noted.

The Enterprise Edition requires a Trial License or can be fully licensed for production, commercial applications.

DeepSparse C++ API
DeepSparse CLI