🚨 Note: The current Docs site is outdated. Neural Magic's 1.7 release slated for January 2024 will include a Docs refresh. Meanwhile, please consult our GitHub repositories for the content:   DeepSparse,   SparseML,   SparseZoo.
Neural Magic LogoNeural Magic Logo
Use Cases
Object Detection

Deploying an Object Detection Model With Ultralytics YOLOv5 and DeepSparse

This page explains how to deploy an object detection model with DeepSparse.

DeepSparse allows accelerated inference, serving, and benchmarking of sparsified Ultralytics YOLOv5 models. The Ultralytics integration enables you to easily deploy sparsified YOLOv5 with DeepSparse for GPU-class performance directly on the CPU.

This integration currently supports the original YOLOv5 and updated V6.1 architectures.

Installation Requirements

This use case requires the installation of DeepSparse Server+YOLO.

Getting Started

Before you start using DeepSparse, confirm your machine is compatible with our hardware requirements.

Model Format

To deploy an image classification model using DeepSparse, pass the model in the ONNX format. This grants the DeepSparse the flexibility to serve any model in a framework-agnostic environment.

Below we describe two possibilities to obtain the required ONNX model.

Exporting the ONNX File From SparseML

This pathway is relevant if you intend to deploy a model created using the SparseML library.

After training your model with SparseML, locate the .pt file for the model you'd like to export and run the ONNX export script below.

1sparseml.yolov5.export_onnx \
2 --weights path/to/your/model \
3 --dynamic #Allows for dynamic input shape

This creates a model.onnx file in the local filesystem in the directory of your weights.

The examples below use SparseZoo stubs, but simply pass the path to model.onnx in place of the stubs to use the local model.

Using the ONNX File in the SparseZoo

This pathway is relevant if you plan to use an off-the-shelf model from the SparseZoo.

When a SparseZoo stub is passed to the model, DeepSparse downloads the appropriate ONNX and other configuration files from the SparseZoo repository. For example, the SparseZoo stub for the pruned (not quantized) YOLOv5 is:


The deployment API examples use SparseZoo stubs to highlight this pathway.

Deployment APIs

DeepSparse provides both a Python Pipeline API and an out-of-the-box model server that can be used for end-to-end inference in either Python workflows or as an HTTP endpoint. Both options provide similar specifications for configurations and support annotation serving for all YOLOv5 models.

Python API

Pipelines is the default interface for running inference with DeepSparse.

Once a model is obtained, either through SparseML training or directly from SparseZoo, Pipeline can be used to easily facilitate end-to-end inference and deployment of the sparsified neural networks.

If no model is specified to the Pipeline for a given task, the Pipeline will automatically select a pruned and quantized model for the task from the SparseZoo that can be used for accelerated inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size, and accuracy.

HTTP Server

As an alternative to the Python API, DeepSparse Server allows you to serve ONNX models and pipelines in HTTP. Configuring the server uses the same parameters and schemas as the Pipelines, enabling simple deployment. Once launched, a /docs endpoint is created with full endpoint descriptions and support for making sample requests.

An example of starting and requesting a DeepSparse Server for YOLOv5 is given below.

Deployment Examples

The following section includes example usage of the Pipeline and server APIs for various image classification models. Each example uses a SparseZoo stub to pull down the model, but a local path to an ONNX file can also be passed as the model_path.

Python API

If you don't have an image ready, pull a sample image down with:

wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg

Create a Pipeline and run inference with the following.

1from deepsparse import Pipeline
3model_stub = "zoo:cv/detection/yolov5-l/pytorch/ultralytics/coco/pruned-aggressive_98"
4images = ["basilica.jpg"]
6yolo_pipeline = Pipeline.create(
7 task="yolo",
8 model_path=model_stub,
11pipeline_outputs = yolo_pipeline(images=images, iou_thres=0.6, conf_thres=0.001)

Annotate CLI

You can also use the annotate command to have the Engine save an annotated photo on disk.

deepsparse.object_detection.annotate --source basilica.jpg #Try --source 0 to annotate your live webcam feed

Running the above command will create an annotation-results folder and save the annotated image inside.

If a --model_filepath argument is not provided, zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned-aggressive_96 will be used by default.

HTTP Server

Spinning up:

1deepsparse.server \
2 task yolo \
3 --model_path "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94"

Making a request:

1import requests
2import json
4url = ''
5path = ['basilica.jpg'] # list of images for inference
6files = [('request', open(img, 'rb')) for img in path]
7resp = requests.post(url=url, files=files)
8annotations = json.loads(resp.text) # dictionary of annotation results
9bounding_boxes = annotations["boxes"]
10labels = annotations["labels"]


The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs. Want to find out how fast our sparse YOLOv5 ONNX models perform inference? You can quickly do benchmarking tests on your own with a single CLI command!

You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:

1deepsparse.benchmark \
2 zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 \
3 --scenario sync
>Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
>Batch Size: 1
>Scenario: sync
>Throughput (items/sec): 74.0355
>Latency Mean (ms/batch): 13.4924
>Latency Median (ms/batch): 13.4177
>Latency Std (ms/batch): 0.2166
>Iterations: 741

To learn more about benchmarking, refer to the appropriate documentation. Also, check out our Benchmarking Tutorial on GitHub.

Sparsifying Object Detection With Ultralytics YOLOv5 and SparseML
Embedding Extraction Deployment