This page explains how to run a trained model with DeepSparse for NLP inside a Python API called Pipelines.
Pipelines
wraps key utilities around DeepSparse for easy testing and deployment.
The text classification Pipeline
, for example, wraps an NLP model with the proper pre-processing and post-processing pipelines, such as tokenization.
This enables passing in raw text sequences and receiving the labeled predictions from DeepSparse without any extra effort.
In this way, DeepSparse combines the simplicity of Pipelines
with GPU-class performance on CPUs for sparse models.
This example requires DeepSparse General Installation.
The first step is collecting an ONNX representaiton of the model and required configuration files.
The text classification Pipeline
is integrated with Hugging Face and uses Hugging Face's standards
and configurations for model setup. The following files are required:
model.onnx
- Exported Transformers model in the ONNX format.tokenizer.json
- Hugging Face tokenizer used with the model.tokenizer_config.json
- Hugging Face tokenizer configuration used with the model.config.json
- Hugging Face configuration file used with the model.For an example of the configuration files, check out BERT's model page on Hugging Face.
There are two options for passing these files to DeepSparse:
SparseZoo contains several pre-sparsified Transformer models, including the configuration files listed above. DeepSparse is integrated with SparseZoo, and supports SparseZoo stubs as inputs for automatic download and inclusion into easy testing and deployment.
The SparseZoo stubs can be found on SparseZoo model pages, and DistilBERT examples are provided below:
zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni
zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/base-none
These SparseZoo stubs are passed arguments to the Pipeline
constructor in the examples below.
Alternatively, you can use a custom or fine-tuned model from your local drive.
There are three steps to using a local model with Pipelines
:
model.onnx
(if you trained with SparseML, use ONNX export).Pass the path of the local directory in the --model_path
in place of the SparseZoo stubs in the examples below.
With the text classification model set up, the model can be passed into a DeepSparse Pipeline
utilizing the model_path
argument.
The SparseZoo stub for the sparse-quantized DistilBERT model given at the beginning is used in the sample code below.
The Pipeline
automatically downloads the necessary files for the model from the SparseZoo and compiles them on your local machine in DeepSparse.
Once compiled, the model Pipeline
is ready for inference with text sequences.
1from deepsparse import Pipeline23classification_pipeline = Pipeline.create(4 task="text-classification",5 model_path="zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni",6)7inference = classification_pipeline(8 [[9 "Fun for adults and children.",10 "Fun for only children.",11 ]]12)13print(inference)>labels=['contradiction'] scores=[0.9983579516410828]
Because DistilBERT is a language model trained on the MNLI dataset, it can additionally be used to perform zero-shot text classification for any text sequences. The code below gives an example of a zero-shot text classification pipeline.
1from deepsparse import Pipeline23zero_shot_pipeline = Pipeline.create(4 task="zero_shot_text_classification",5 model_path="zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni",6 model_scheme="mnli",7 model_config={"hypothesis_template": "This text is related to {}"},8)9inference = zero_shot_pipeline(10 sequences='Who are you voting for in 2020?',11 labels=['politics', 'public health', 'Europe'],12)13print(inference)>sequences='Who are you voting for in 2020?' labels=['politics', 'Europe', 'public health'] scores=[0.9345628619194031, 0.039115309715270996, 0.026321841403841972]
The DeepSparse installation includes a benchmark CLI for convenient and easy inference benchmarking: deepsparse.benchmark
.
The CLI takes in either a SparseZoo stub or a path to a local model.onnx
file.
The code below provides an example for benchmarking a dense DistilBERT model with DeepSparse. The output shows that the model achieved 32.6 items per second on a 4-core CPU.
$deepsparse.benchmark zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/base-none>DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.0 (8eaddc24) (release) (optimized) (system=avx512, binary=avx512)>Original Model Path: zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/base-none>Batch Size: 1>Scenario: async>Throughput (items/sec): 32.2806>Latency Mean (ms/batch): 61.9034>Latency Median (ms/batch): 61.7760>Latency Std (ms/batch): 0.4792>Iterations: 324
Running on the same server, the code below shows how the benchmarks change when utilizing a sparsified version of DistilBERT. It achieved 221.0 items per second, a 6.8X increase in performance over the dense baseline.
$deepsparse.benchmark zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni>DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.0.0 (8eaddc24) (release) (optimized) (system=avx512, binary=avx512)>Original Model Path: zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/mnli/pruned80_quant-none-vnni>Batch Size: 1>Scenario: async>Throughput (items/sec): 220.9794>Latency Mean (ms/batch): 9.0147>Latency Median (ms/batch): 9.0085>Latency Std (ms/batch): 0.1037>Iterations: 2210