Neural Magic LogoNeural Magic Logo
Quick Tour

Quick Tour

The Neural Magic Platform enables you to (1) Optimize a Model for Inference and (2) Deploy a Model on CPUs with GPU-class performance.

This page walks through the major functionality and provides pointers to more details.

Optimize a Model for Inference With SparseML

SparseML and SparseZoo enable users to create models that are optimized for inference. With an inference-optimized model, users can reach GPU-class performance when deploying with DeepSparse on CPUs.

There are two workflows that allow users to accomplish this goal:

  1. Sparse Transfer Learning: fine-tune pre-sparsified models onto custom data
  2. Sparsification From Scratch: apply pruning and quantization to any model

Sparse Transfer Learning is recommended for use cases with pre-sparsified models in SparseZoo. Sparsification From Scratch can be used to optimize any model but requires experimenting with hyperparameters to reach high levels of sparsity with high accuracy.

Each workflow can be applied via CLI scripts or Python code.

CLI Scripts

For supported use cases, SparseML provides CLI scripts that enable users to kick off Sparse Transfer Learning or Sparsification From Scratch runs with a single command.

Each use case has slightly different arguments that align to their integrations (the Transformer scripts adhere to Hugging Face format while the YOLOv5 scripts adhere to Ultralytics format), but they generally look something like the following:

3 --dataset [LOCAL_PATH]
5 --other_configs [e.g. BATCH_SIZE, EPOCHS, etc.]

Let's break down each argument:

  • --model points SparseML to a trained model which is the starting point for the training process. In Sparse Transfer Learning, this is usually a SparseZoo stub that points to the pre-sparsified model of choice. In Sparsification From Scratch, this is usually a path to a trained PyTorch or TensorFlow model in a local filesystem.

  • --dataset points SparseML to the dataset to be used (both STL and SFS require training data). Datasets must be provided in the form expected by the underlying integration. For instance, if training YOLOv5, data must be provided in the YOLOv5 format and if training Transformers, data must be provided in the Hugging Face format.

  • --recipe points SparseML to a YAML file called a recipe. Recipes encode sparsity-related hyperparameters used by SparseML. For instance, a recipe for Sparsification From Scratch encodes the target sparsity level for each layer while a recipe for Sparse Transfer Learning instructs SparseML to maintain sparsity as the fine-tuning occurs.

You can now see why SparseML makes Sparse Transfer Learning so easy. All you have to do is point SparseML to a pre-sparsified model and pre-made transfer learning recipe in SparseZoo and to your own dataset and you are off!

There are also pre-made sparsification from scratch recipes availble in the SparseZoo. For models not yet in SparseZoo, SparseML's declarative recipes make it easy to specify hyperparameters, allowing you to focus on running experiments rather than writing code.

Additional Resources

Python API

For users needing flexibility for an unsupported use case or a custom training setup, SparseML provides Python APIs that let you integrate SparseML into any PyTorch or TensorFlow pipeline.

Because of the declarative nature of recipes, users can apply Sparse Transfer Learning and Sparsification From Scratch with just three additional lines of code around a training pipeline.

The following code illustrates all that is needed:

1from sparseml.pytorch.optim import ScheduledModifierManager
3model = Model(...) # typical torch model
4optimizer = Optimizer(...) # typical torch optimizer
5manager = ScheduledModifierManager.from_yaml(recipe_path)
6optimizer = manager.modify(model, optimizer, steps_per_epoch)
8# ...your typical training loop, using model/optimizer as usual

Let's break down this example step-by-step:

  • model and optimizer are the typical PyTorch objects used in every training loop.
  • ScheduledModifierManager.from_yaml(recipe_path) accepts a recipe_path, which points to the location of a YAML file called a Recipe. The Recipes encode the hyperparameters of the Sparse Transfer Learning or Sparsification From Scratch workflows.
  • manager.modify(...) edits the model and optimizer objects to run the Sparse Transfer Learning or Sparsification From Scratch algorithms specified in the recipe.
  • The model and optimizer are then used as usual in a training loop. If a Sparsification from Scratch recipe was given to the manager, then the optimizer will gradually prune weights according to the recipe. If a Sparsification from Scratch recipe was passed, then pruned weights will remain at zero during gradient updates.

Additional Resources

Want to Learn More?

Check out our conceptual guide on optimizing for inference with sparsity.

Deploy on CPUs With DeepSparse

DeepSparse is a CPU-only deep learning deployment platform. It wraps a sparsity-aware runtime that reaches GPU-class performance on inference-optimized sparse models with APIs that simplify the process of integrating a model into an application.

There are three primary interfaces for interacting with DeepSparse:

  1. Pipeline - Python APIs that wrap the runtime with pre-processing and post-processing
  2. Server - REST APIs that allow you to create a model service around a Pipeline
  3. Engine - Python APIs that provide direct access to the runtime

Pipeline and Server are the preferred pathways for interacting with DeepSparse.


DeepSparse Pipelines make it easy to integrate DeepSparse into an application, by wrapping pre and post-processing around the inference runtime. For example, a DeepSparse Pipeline in the NLP domain handles the tokenization process, meaning you can pass raw strings and receive answers and a DeepSparse Pipeline in the Object Detection domain handles input normalization (mean and std transform) as well as the non-max supression of output, meaning you can just pass raw images and receive the bounding boxes.

For supported use cases, Pipelines are pre-made. For unsupported use cases, you can create a custom Pipeline by specifying a pre and post-processing function, creating a consistent interface for interacting with DeepSparse.

Pipeline Usage - Python API

For a supported use case, the Pipeline class is workhorse that you will use. Simplify specify a use case via the task argument and a model in ONNX format via the model_path argument and you are off!

1from deepsparse import Pipeline
2example_pipeline = Pipeline.create(
3 task="example_task", # e.g. image_classification or question_answering
4 model_path="model.onnx", # local model or SparseZoo stub
7# pass raw, unprocessed input
8pipeline_inputs = ["The quick brown fox jumps over the lazy dog"]
10# get back post-processed outputs
11pipeline_outputs = example_pipeline(pipeline_inputs)

For an unsupported use case, you will use CustomTaskPipeline to create a Pipeline. Simply specify a pre-processing and post-processing function and a model in ONNX format.

1from deepsparse.pipelines.custom_pipeline import CustomTaskPipeline
3def preprocess(inputs):
4 pass # define your function
5def postprocess(outputs):
6 pass # define your function
8custom_pipeline = CustomTaskPipeline(
9 model_path="custom_model.onnx",
10 process_inputs_fn=preprocess,
11 process_outputs_fn=postprocess,
14# pass raw, unprocessed input
15pipeline_inputs = ["The quick brown fox jumps over the lazy dog"]
17# get back post-processed outputs
18pipeline_outputs = custom_pipeline(pipeline_inputs)

Additional Resources

Beyond pre-processing and post-processing, Pipelines also have other useful utilites like Data Logging, Multi-Stream Inference, and Dynamic Batch. Check out the documentation on the Pipeline Class or the ad-hoc user guides:

  • Multi-Stream Scheduling Overview
  • Example Using Multi-Stream in Pipelines [Docs Coming Soon]
  • Data Logging in Pipelines [Docs Coming Soon]
  • Dynamic Batch [Docs Coming Soon]


DeepSparse Server wraps Pipelines with REST API using FastAPI web framework and uvicorn web server. This enables you to spin up a model service around DeepSparse with no code.

Since Server is a wrapper around Pipelines, the Server inherits all of the functionality of Pipelines (including the pre- and post-processing phases), meaning you can pass raw unprocessed inputs to the Server and receive post-processed predictions.

Server Usage - Launch From CLI

DeepSparse Server is launched from the CLI, with configuration via either command line arguments or a configuration file.

With the command line argument path, users specify a use case via the task argument (e.g., image_classification or question_answering) as well as a model (either a local ONNX file or a SparseZoo stub) via the model_path argument:

deepsparse.server --task [use_case_name] --model_path [model_path]

With the config file path, users create a YAML file that specifies the server configuration. A YAML file looks like the following:

2 - task: task_name # specifiy use case (e.g., image_classification, question_answering)
3 route: /predict # specify the route of the endpoint
4 model: model_path # specify sparsezoo stub or path to local onnx file
5 name: any_name_you_want
7# - ... add as many endpoints as neeede

The Server is then launched with the following:

deepsparse.server --config_file config.yaml

Clients interact with the Server via HTTP. Because the Server uses Pipelines internally, users can simply pass raw data to the Server and receive back post-processed predictions.

For example, a user would do the following to query a Question Answering endpoint:

1import requests
3url = "http://localhost:5543/predict"
5obj = {
6 "question": "Who is Mark?",
7 "context": "Mark is batman."
10response =, json=obj)

Additional Resources

The Server also has other useful utilites like Data Logging, Multi-Stream Inference, Multiple Model Inference and Dynamic Batch. Check out the documentation on the Server Class or the ad-hoc user guides:

  • Multi-Stream Scheduling Overview
  • Example Using Multi-Stream in Pipelines [Docs Coming Soon]
  • Data Logging in Pipelines [Docs Coming Soon]
  • Dynamic Batch [Docs Coming Soon]


Engine is the lowest supported level of interaction available with the runtime.

This pathway enables users that want more control over the runtime or want to run pre-processing and post-processing manually to do so.

Engine Usage - Python API

The Engine class is the workhorse for this pathway. Simply call the constructor with your desired parameters to create an instance with the runtime. Once the Engine is initialized, just a pass lists of numpy arrays (which are a batch of input tensors - the same as would be passed to a PyTorch model) and the Engine will return a list of outputs.

For example:

1from deepsparse import Engine
2from deepsparse.utils import generate_random_inputs
3onnx_filepath = "path/to/onnx/model.onnx"
4batch_size = 64
6# Generate random sample input
7inputs = generate_random_inputs(onnx_filepath, batch_size)
9# Compile and run
10engine = Engine(onnx_filepath, batch_size)
11outputs =

Additional Resources

There is also a MultiModelEngine available for users who want to interact directly with an Engine running multiple models (note: if you want to run multiple models on the same CPU, this pathway is strongly preferred.)

We also have a lower-level C++ API. Stay tuned for new documentation on this pathway or reach out in Community Slack for details of this.

Deploy on CPUs
Install Neural Magic Platform