The Neural Magic Platform enables you to (1) Optimize a Model for Inference and (2) Deploy a Model on CPUs with GPU-class performance.
This page walks through the major functionality and provides pointers to more details.
SparseML and SparseZoo enable users to create models that are optimized for inference. With an inference-optimized model, users can reach GPU-class performance when deploying with DeepSparse on CPUs.
There are two workflows that allow users to accomplish this goal:
Sparse Transfer Learning is recommended for use cases with pre-sparsified models in SparseZoo. Sparsification From Scratch can be used to optimize any model but requires experimenting with hyperparameters to reach high levels of sparsity with high accuracy.
Each workflow can be applied via CLI scripts or Python code.
For supported use cases, SparseML provides CLI scripts that enable users to kick off Sparse Transfer Learning or Sparsification From Scratch runs with a single command.
Each use case has slightly different arguments that align to their integrations (the Transformer scripts adhere to Hugging Face format while the YOLOv5 scripts adhere to Ultralytics format), but they generally look something like the following:
1sparseml.[use_case].train2 --model [LOCAL_PATH / SPARSEZOO_STUB]3 --dataset [LOCAL_PATH]4 --recipe [LOCAL_PATH / SPARSEZOO_RECIPE_STUB]5 --other_configs [e.g. BATCH_SIZE, EPOCHS, etc.]
Let's break down each argument:
--model
points SparseML to a trained model which is the starting point for the training process. In Sparse Transfer Learning,
this is usually a SparseZoo stub that points to the pre-sparsified model of choice. In Sparsification From Scratch, this is usually
a path to a trained PyTorch or TensorFlow model in a local filesystem.
--dataset
points SparseML to the dataset to be used (both STL and SFS require training data). Datasets must be provided in the form expected
by the underlying integration. For instance, if training YOLOv5, data must be provided in the YOLOv5 format and if training Transformers,
data must be provided in the Hugging Face format.
--recipe
points SparseML to a YAML file called a recipe. Recipes encode sparsity-related hyperparameters used by SparseML.
For instance, a recipe for Sparsification From Scratch encodes the target sparsity level for each layer while a recipe for Sparse Transfer Learning
instructs SparseML to maintain sparsity as the fine-tuning occurs.
You can now see why SparseML makes Sparse Transfer Learning so easy. All you have to do is point SparseML to a pre-sparsified model and pre-made transfer learning recipe in SparseZoo and to your own dataset and you are off!
There are also pre-made sparsification from scratch recipes availble in the SparseZoo. For models not yet in SparseZoo, SparseML's declarative recipes make it easy to specify hyperparameters, allowing you to focus on running experiments rather than writing code.
Additional Resources
For users needing flexibility for an unsupported use case or a custom training setup, SparseML provides Python APIs that let you integrate SparseML into any PyTorch or TensorFlow pipeline.
Because of the declarative nature of recipes, users can apply Sparse Transfer Learning and Sparsification From Scratch with just three additional lines of code around a training pipeline.
The following code illustrates all that is needed:
1from sparseml.pytorch.optim import ScheduledModifierManager23model = Model(...) # typical torch model4optimizer = Optimizer(...) # typical torch optimizer5manager = ScheduledModifierManager.from_yaml(recipe_path)6optimizer = manager.modify(model, optimizer, steps_per_epoch)78# ...your typical training loop, using model/optimizer as usual910manager.finalize(model)
Let's break down this example step-by-step:
model
and optimizer
are the typical PyTorch objects used in every training loop.ScheduledModifierManager.from_yaml(recipe_path)
accepts a recipe_path
, which points to the location of
a YAML file called a Recipe. The Recipes encode the hyperparameters of the Sparse Transfer Learning or Sparsification From Scratch workflows. manager.modify(...)
edits the model
and optimizer
objects to run the Sparse Transfer Learning or
Sparsification From Scratch algorithms specified in the recipe.model
and optimizer
are then used as usual in a training loop. If a Sparsification from Scratch recipe was
given to the manager
, then the optimizer
will gradually prune weights according to the recipe. If a Sparsification
from Scratch recipe was passed, then pruned weights will remain at zero during gradient updates.Additional Resources
Want to Learn More?
Check out our conceptual guide on optimizing for inference with sparsity.
DeepSparse is a CPU-only deep learning deployment platform. It wraps a sparsity-aware runtime that reaches GPU-class performance on inference-optimized sparse models with APIs that simplify the process of integrating a model into an application.
There are three primary interfaces for interacting with DeepSparse:
Pipeline and Server are the preferred pathways for interacting with DeepSparse.
DeepSparse Pipelines make it easy to integrate DeepSparse into an application, by wrapping pre and post-processing around the inference runtime. For example, a DeepSparse Pipeline in the NLP domain handles the tokenization process, meaning you can pass raw strings and receive answers and a DeepSparse Pipeline in the Object Detection domain handles input normalization (mean and std transform) as well as the non-max supression of output, meaning you can just pass raw images and receive the bounding boxes.
For supported use cases, Pipelines are pre-made. For unsupported use cases, you can create a custom Pipeline by specifying a pre and post-processing function, creating a consistent interface for interacting with DeepSparse.
Pipeline Usage - Python API
For a supported use case, the Pipeline
class is workhorse that you will use. Simplify specify a use case via the
task
argument and a model in ONNX format via the model_path
argument and you are off!
1from deepsparse import Pipeline2example_pipeline = Pipeline.create(3 task="example_task", # e.g. image_classification or question_answering4 model_path="model.onnx", # local model or SparseZoo stub5)67# pass raw, unprocessed input8pipeline_inputs = ["The quick brown fox jumps over the lazy dog"]910# get back post-processed outputs11pipeline_outputs = example_pipeline(pipeline_inputs)
For an unsupported use case, you will use CustomTaskPipeline
to create a Pipeline. Simply specify a
pre-processing and post-processing function and a model in ONNX format.
1from deepsparse.pipelines.custom_pipeline import CustomTaskPipeline23def preprocess(inputs):4 pass # define your function5def postprocess(outputs):6 pass # define your function78custom_pipeline = CustomTaskPipeline(9 model_path="custom_model.onnx",10 process_inputs_fn=preprocess,11 process_outputs_fn=postprocess,12)1314# pass raw, unprocessed input15pipeline_inputs = ["The quick brown fox jumps over the lazy dog"]1617# get back post-processed outputs18pipeline_outputs = custom_pipeline(pipeline_inputs)
Additional Resources
Beyond pre-processing and post-processing, Pipelines also have other useful utilites like Data Logging, Multi-Stream Inference, and Dynamic Batch. Check out the documentation on the Pipeline Class or the ad-hoc user guides:
DeepSparse Server wraps Pipelines with REST API using FastAPI web framework and uvicorn web server. This enables you to spin up a model service around DeepSparse with no code.
Since Server is a wrapper around Pipelines, the Server inherits all of the functionality of Pipelines (including the pre- and post-processing phases), meaning you can pass raw unprocessed inputs to the Server and receive post-processed predictions.
Server Usage - Launch From CLI
DeepSparse Server is launched from the CLI, with configuration via either command line arguments or a configuration file.
With the command line argument path, users specify a use case via the task
argument (e.g., image_classification
or question_answering
) as
well as a model (either a local ONNX file or a SparseZoo stub) via the model_path
argument:
deepsparse.server --task [use_case_name] --model_path [model_path]
With the config file path, users create a YAML file that specifies the server configuration. A YAML file looks like the following:
1endpoints:2 - task: task_name # specifiy use case (e.g., image_classification, question_answering)3 route: /predict # specify the route of the endpoint4 model: model_path # specify sparsezoo stub or path to local onnx file5 name: any_name_you_want67# - ... add as many endpoints as neeede
The Server is then launched with the following:
deepsparse.server --config_file config.yaml
Clients interact with the Server via HTTP. Because the Server uses Pipelines internally, users can simply pass raw data to the Server and receive back post-processed predictions.
For example, a user would do the following to query a Question Answering endpoint:
1import requests23url = "http://localhost:5543/predict"45obj = {6 "question": "Who is Mark?",7 "context": "Mark is batman."8}910response = requests.post(url, json=obj)
Additional Resources
The Server also has other useful utilites like Data Logging, Multi-Stream Inference, Multiple Model Inference and Dynamic Batch. Check out the documentation on the Server Class or the ad-hoc user guides:
Engine is the lowest supported level of interaction available with the runtime.
This pathway enables users that want more control over the runtime or want to run pre-processing and post-processing manually to do so.
The Engine class is the workhorse for this pathway. Simply call the constructor with your desired parameters to create an instance with the runtime. Once the Engine is initialized, just a pass lists of numpy arrays (which are a batch of input tensors - the same as would be passed to a PyTorch model) and the Engine will return a list of outputs.
For example:
1from deepsparse import Engine2from deepsparse.utils import generate_random_inputs3onnx_filepath = "path/to/onnx/model.onnx"4batch_size = 6456# Generate random sample input7inputs = generate_random_inputs(onnx_filepath, batch_size)89# Compile and run10engine = Engine(onnx_filepath, batch_size)11outputs = engine.run(inputs)
Additional Resources
There is also a MultiModelEngine
available for users who want to interact directly with an Engine running
multiple models (note: if you want to run multiple models on the same CPU, this pathway is strongly preferred.)
We also have a lower-level C++ API. Stay tuned for new documentation on this pathway or reach out in Community Slack for details of this.