🚨 Note: The current Docs site is outdated. Neural Magic's 1.7 release slated for January 2024 will include a Docs refresh. Meanwhile, please consult our GitHub repositories for the content:   DeepSparse,   SparseML,   SparseZoo.
Neural Magic LogoNeural Magic Logo
Get Started
Use a Model
NLP Text Classification

Use a Text Classification Model

This page explains how to run a trained model with DeepSparse for NLP inside a Python API called Pipelines.

Pipelines wraps key utilities around DeepSparse for easy testing and deployment.

The text classification Pipeline, for example, wraps an NLP model with the proper pre-processing and post-processing pipelines, such as tokenization. This enables passing in raw text sequences and receiving the labeled predictions from DeepSparse without any extra effort. In this way, DeepSparse combines the simplicity of Pipelines with GPU-class performance on CPUs for sparse models.

Installation Requirements

This example requires DeepSparse General Installation.

Model Setup

The first step is collecting an ONNX representaiton of the model and required configuration files. The text classification Pipeline is integrated with Hugging Face and uses Hugging Face's standards and configurations for model setup. The following files are required:

  • model.onnx - Exported Transformers model in the ONNX format.
  • tokenizer.json - Hugging Face tokenizer used with the model.
  • tokenizer_config.json - Hugging Face tokenizer configuration used with the model.
  • config.json - Hugging Face configuration file used with the model.

For an example of the configuration files, check out BERT's model page on Hugging Face.

There are two options for passing these files to DeepSparse:

1) Using SparseZoo stubs (recommended starting point)

SparseZoo contains several pre-sparsified Transformer models, including the configuration files listed above. DeepSparse is integrated with SparseZoo, and supports SparseZoo stubs as inputs for automatic download and inclusion into easy testing and deployment.

The SparseZoo stubs can be found on SparseZoo model pages, and DistilBERT examples are provided below:


These SparseZoo stubs are passed arguments to the Pipeline constructor in the examples below.

2) Using a local model

Alternatively, you can use a custom or fine-tuned model from your local drive.

There are three steps to using a local model with Pipelines:

  1. Export the model to model.onnx (if you trained with SparseML, use ONNX export).
  2. Collect the configuration files listed above. These are generally stored with the resulting model files from Hugging Face training pipelines (as is the case with SparseML).
  3. Place the files into a directory.

Pass the path of the local directory in the --model_path in place of the SparseZoo stubs in the examples below.

Inference Pipelines

With the text classification model set up, the model can be passed into a DeepSparse Pipeline utilizing the model_path argument. The SparseZoo stub for the sparse-quantized DistilBERT model given at the beginning is used in the sample code below. The Pipeline automatically downloads the necessary files for the model from the SparseZoo and compiles them on your local machine in DeepSparse. Once compiled, the model Pipeline is ready for inference with text sequences.

1from deepsparse import Pipeline
3classification_pipeline = Pipeline.create(
4 task="text-classification",
5 model_path="zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni",
7inference = classification_pipeline(
8 [[
9 "Fun for adults and children.",
10 "Fun for only children.",
11 ]]
>labels=['contradiction'] scores=[0.9983579516410828]

Because DistilBERT is a language model trained on the MNLI dataset, it can additionally be used to perform zero-shot text classification for any text sequences. The code below gives an example of a zero-shot text classification pipeline.

1from deepsparse import Pipeline
3zero_shot_pipeline = Pipeline.create(
4 task="zero_shot_text_classification",
5 model_path="zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni",
6 model_scheme="mnli",
7 model_config={"hypothesis_template": "This text is related to {}"},
9inference = zero_shot_pipeline(
10 sequences='Who are you voting for in 2020?',
11 labels=['politics', 'public health', 'Europe'],
>sequences='Who are you voting for in 2020?' labels=['politics', 'Europe', 'public health'] scores=[0.9345628619194031, 0.039115309715270996, 0.026321841403841972]


The DeepSparse installation includes a benchmark CLI for convenient and easy inference benchmarking: deepsparse.benchmark. The CLI takes in either a SparseZoo stub or a path to a local model.onnx file.

Dense DistilBERT

The code below provides an example for benchmarking a dense DistilBERT model with DeepSparse. The output shows that the model achieved 32.6 items per second on a 4-core CPU.

$deepsparse.benchmark zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/base-none
>DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.0 (8eaddc24) (release) (optimized) (system=avx512, binary=avx512)
>Original Model Path: zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/base-none
>Batch Size: 1
>Scenario: async
>Throughput (items/sec): 32.2806
>Latency Mean (ms/batch): 61.9034
>Latency Median (ms/batch): 61.7760
>Latency Std (ms/batch): 0.4792
>Iterations: 324

Sparsified DistilBERT

Running on the same server, the code below shows how the benchmarks change when utilizing a sparsified version of DistilBERT. It achieved 221.0 items per second, a 6.8X increase in performance over the dense baseline.

$deepsparse.benchmark zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni
>DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.0 (8eaddc24) (release) (optimized) (system=avx512, binary=avx512)
>Original Model Path: zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni
>Batch Size: 1
>Scenario: async
>Throughput (items/sec): 220.9794
>Latency Mean (ms/batch): 9.0147
>Latency Median (ms/batch): 9.0085
>Latency Std (ms/batch): 0.1037
>Iterations: 2210
Use a Model
Use an Object Detection Model