Quick Tour

To expedite inference and benchmarking on real models, we include the sparsezoo package. SparseZoo hosts inference-optimized models, trained on repeatable sparsification recipes using state-of-the-art techniques from SparseML.

Quickstart with SparseZoo ONNX Models

ResNet-50 Dense

Here is how to quickly perform inference with DeepSparse Engine on a pre-trained dense ResNet-50 from SparseZoo.

from deepsparse import compile_model
from sparsezoo.models import classification

batch_size = 64

# Download model and compile as optimized executable for your machine
model = classification.resnet_50()
engine = compile_model(model, batch_size=batch_size)

# Fetch sample input and predict output using engine
inputs = model.data_inputs.sample_batch(batch_size=batch_size)
outputs, inference_time = engine.timed_run(inputs)

ResNet-50 Sparsified

When exploring available optimized models, you can use the Zoo.search_optimized_models utility to find models that share a base.

Try this on the dense ResNet-50 to see what is available:

from sparsezoo import Zoo
from sparsezoo.models import classification

model = classification.resnet_50()



We can see there are two pruned versions targeting FP32 and two pruned, quantized versions targeting INT8. The conservative, moderate, and aggressive tags recover to 100%, >=99%, and <99% of baseline accuracy respectively.

For a version of ResNet-50 that recovers close to the baseline and is very performant, choose the pruned_quant-moderate model. This model will run nearly 7x faster than the baseline model on a compatible CPU (with the VNNI instruction set enabled). For hardware compatibility, see the Hardware Support section.

from deepsparse import compile_model
import numpy

batch_size = 64
sample_inputs = [numpy.random.randn(batch_size, 3, 224, 224).astype(numpy.float32)]

# run baseline benchmarking
engine_base = compile_model(
benchmarks_base = engine_base.benchmark(sample_inputs)

# run sparse benchmarking
engine_sparse = compile_model(
if not engine_sparse.cpu_vnni:
    print("WARNING: VNNI instructions not detected, quantization speedup not well supported")
benchmarks_sparse = engine_sparse.benchmark(sample_inputs)

print(f"Speedup: {benchmarks_sparse.items_per_second / benchmarks_base.items_per_second:.2f}x")

Quickstart with Custom ONNX Models

We accept ONNX files for custom models, too. Simply plug in your model to compare performance with other solutions.

> wget https://github.com/onnx/models/raw/master/vision/classification/mobilenet/model/mobilenetv2-7.onnx
Saving to: ‘mobilenetv2-7.onnx’
from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "mobilenetv2-7.onnx"
batch_size = 16

# Generate random sample input
inputs = generate_random_inputs(onnx_filepath, batch_size)

# Compile and run
engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)