Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparse EngineSparseMLSparseZoo
Use Cases
Natural Language Processing
Deploying

Deploying NLP Models with Hugging Face Transformers and DeepSparse

This page explains how to deploy a sparse Transformer on DeepSparse.

DeepSparse allows accelerated inference, serving, and benchmarking of sparsified Hugging Face Transformer models. The Hugging Face integration enables you to easily deploy sparsified Transformers onto the DeepSparse Engine for GPU-class performance directly on the CPU.

This integration currently supports several fundamental NLP tasks out of the box:

  • Question Answering - posing questions about a document
  • Sentiment Analysis - assigning a sentiment to a piece of text
  • Text Classification - assigning a label or class to a piece of text (e.g duplicate question pairing)
  • Token Classification - attributing a label to each token in a sentence (e.g. Named Entity Recognition task)

We are actively working on adding more use cases, stay tuned!

Installation Requirements

This section requires the DeepSparse Server Install.

Getting Started

Before you start using the DeepSparse Engine, confirm that your machine is compatible with our hardware requirements.

Model Format

To deploy a Transformer using DeepSparse Engine, pass the model in the ONNX format along with the Hugging Face supporting files. This grants the engine the flexibility to serve any model in a framework-agnostic environment.

The DeepSparse Pipelines require the following files within a folder on the local server to properly load a Transformers model:

There are two options to collecting these files:

1) Export the ONNX/Config Files From SparseML

This pathway is relevant if you intend to deploy a model created using SparseML.

After training your model with SparseML, locate the .pt file for the model you'd like to export and run the SparseML integrated Transformers ONNX export script below. For example, if you wanted to export a model you had trained to do question answering, use the below:

sparseml.transformers.export_onnx --task question-answering --model_path model_path

This creates model.onnx file and exports it to the local filesystem. tokenizer.json and config.json are also stored in this directory. All of the examples below use SparseZoo stubs, but you can pass the path to the local directory in its place.

2) Pass a SparseZoo Stub To DeepSparse

This pathway is relevant if you plan to use an off-the-shelf model from the SparseZoo.

All of DeepSparse's Pipelines and APIs can use a SparseZoo stub in place of a local folder. The Pipelines use the stubs to locate and download the ONNX and config files from the SparseZoo repo.

The examples below use option 2. However, you can pass the local path to the directory containing the config files in place of the SparseZoo stub.

Deployment APIs

DeepSparse provides both a Python Pipeline API and an out-of-the-box model server that can be used for end-to-end inference in either existing python workflows or as an HTTP endpoint. Both options provide similar specifications for configurations and support a variety of NLP transformers tasks including question answering, text classification, sentiment analysis, and token classification.

Python API

Pipelines are the default interface for running inference with the DeepSparse Engine.

Once a model is obtained, either through SparseML training or directly from SparseZoo, deepsparse.Pipeline can be used to easily facilitate end to end inference and deployment of the sparsified transformers model.

If no model is specified to the Pipeline for a given task, the Pipeline will automatically select a pruned and quantized model for the task from the SparseZoo that can be used for accelerated inference. Note that other models in the SparseZoo will have different tradeoffs between speed, size, and accuracy.

HTTP Server

As an alternative to Python API, the DeepSparse Server allows you to serve ONNX models and pipelines in HTTP. Both configuring and making requests to the server follow the same parameters and schemas as the Pipelines enabling simple deployment. Once launched, a /docs endpoint is created with full endpoint descriptions and support for making sample requests.

Example deployments using NLP transformer models are provided below. For full documentation on deploying sparse transformer models with the DeepSparse Server, see the documentation.

Deployment Use Cases

The following section includes example usage of the Pipeline and server APIs for various NLP transformers tasks.

Question Answering

The question answering tasks accepts a question and a context. The pipeline will predict an answer for the question as a substring of the context. The following examples use a pruned and quantized question answering BERT model trained on the SQuAD dataset downloaded by default from the SparseZoo.

Python Pipeline

1from deepsparse import Pipeline
2
3qa_pipeline = Pipeline.create(task="question-answering")
4inference = qa_pipeline(question="What's my name?", context="My name is Snorlax")
>{'score': 0.9947717785835266, 'start': 11, 'end': 18, 'answer': 'Snorlax'}

HTTP Server

Spinning up:

1deepsparse.server \
2 task question-answering \
3 --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

Making a request:

1import requests
2
3url = "http://localhost:5543/predict" # Server's port default to 5543
4
5obj = {
6 "question": "Who is Mark?",
7 "context": "Mark is batman."
8}
9
10response = requests.post(url, json=obj)
11response.text
>'{"score":0.9534820914268494,"start":8,"end":14,"answer":"batman"}'

Sentiment Analysis

The sentiment analysis task takes in a sentence and classifies its sentiment. The following example uses a pruned and quantized text sentiment analysis BERT model trained on the sst2 dataset downloaded from the SparseZoo. This sst2 model classifies sentences as positive or negative.

Python Pipeline

1from deepsparse import Pipeline
2
3sa_pipeline = Pipeline.create(task="sentiment-analysis")
4
5inference = sa_pipeline("Snorlax loves my Tesla!")
>[{'label': 'LABEL_1', 'score': 0.9884248375892639}] # positive sentiment

HTTP Server

Spinning up:

1deepsparse.server \
2 task sentiment-analysis \
3 --model_path "zoo:nlp/sentiment_analysis/bert-base/pytorch/huggingface/sst2/12layer_pruned80_quant-none-vnni"

Making a request:

1import requests
2
3url = "http://localhost:5543/predict" # Server's port default to 5543
4
5obj = {"sequences": "Snorlax loves my Tesla!"}
6
7response = requests.post(url, json=obj)
8response.text
>'{"labels":["LABEL_1"],"scores":[0.9884248375892639]}'

Text Classification

The text classification task supports binary, multi class, and regression predictions over sentence inputs. The following example uses a pruned and quantized text classification DistilBERT model trained on the qqp dataset downloaded from a SparseZoo stub. The qqp dataset takes pairs of questions and predicts if they are a duplicate or not.

Python Pipeline

1from deepsparse import Pipeline
2
3tc_pipeline = Pipeline.create(
4 task="text-classification",
5 model_path="zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/qqp/pruned80_quant-none-vnni",
6)
7
8# inference of duplicate question pair
9inference = tc_pipeline(
10 sequences=[
11 [
12 "Which is the best gaming laptop under 40k?",
13 "Which is the best gaming laptop under 40,000 rs?",
14 ]
15 ]
16)
>TextClassificationOutput(labels=['duplicate'], scores=[0.9947025775909424])

HTTP Server

Spinning up:

1deepsparse.server \
2 task text-classification \
3 --model_path "zoo:nlp/text_classification/distilbert-none/pytorch/huggingface/qqp/pruned80_quant-none-vnni"

Making a request:

1import requests
2
3url = "http://localhost:5543/predict" # Server's port default to 5543
4
5obj = {
6 "sequences": [
7 [
8 "Which is the best gaming laptop under 40k?",
9 "Which is the best gaming laptop under 40,000 rs?",
10 ]
11 ]
12}
13
14response = requests.post(url, json=obj)
15response.text
>'{"labels": ["duplicate"], "scores": [0.9947025775909424]}'

Token Classification Pipeline

The token classification task takes in sequences as inputs and assigns a class to each token. The following example uses a pruned and quantized token classification NER BERT model trained on the CoNLL dataset downloaded from the SparseZoo.

Python Pipeline

1from deepsparse import Pipeline
2
3# default model is a pruned + quantized NER model trained on the CoNLL dataset
4tc_pipeline = Pipeline.create(task="token-classification")
5inference = tc_pipeline("Drive from California to Texas!")
>[{'entity': 'LABEL_0','word': 'drive', ...},
>{'entity': 'LABEL_0','word': 'from', ...},
>{'entity': 'LABEL_5','word': 'california', ...},
>{'entity': 'LABEL_0','word': 'to', ...},
>{'entity': 'LABEL_5','word': 'texas', ...},
>{'entity': 'LABEL_0','word': '!', ...}]

HTTP Server

Spinning up:

1deepsparse.server \
2 task token-classification \
3 --model_path "zoo:nlp/token_classification/bert-base/pytorch/huggingface/conll2003/12layer_pruned80_quant-none-vnni"

Making a request:

1import requests
2
3url = "http://localhost:5543/predict" # Server's port default to 5543
4
5obj = {"inputs": "Drive from California to Texas!"}
6
7
8response = requests.post(url, json=obj)
9response.text
>'{"predictions":[[{"entity":"LABEL_0","score":0.9998655915260315,"index":1,"word":"drive","start":0,"end":5,"is_grouped":false},{"entity":"LABEL_0","score":0.9998604655265808,"index":2,"word":"from","start":6,"end":10,"is_grouped":false},{"entity":"LABEL_5","score":0.9994636178016663,"index":3,"word":"california","start":11,"end":21,"is_grouped":false},{"entity":"LABEL_0","score":0.999838650226593,"index":4,"word":"to","start":22,"end":24,"is_grouped":false},{"entity":"LABEL_5","score":0.9994573593139648,"index":5,"word":"texas","start":25,"end":30,"is_grouped":false},{"entity":"LABEL_0","score":0.9998716711997986,"index":6,"word":"!","start":30,"end":31,"is_grouped":false}]]}'

Benchmarking

The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs. Want to find out how fast our sparse Hugging Face ONNX models perform inference? You can quickly do benchmarking tests on your own with a single CLI command!

You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:

1deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
>Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
>Batch Size: 1
>Scenario: multistream
>Throughput (items/sec): 76.3484
>Latency Mean (ms/batch): 157.1049
>Latency Median (ms/batch): 157.0088
>Latency Std (ms/batch): 1.4860
>Iterations: 768

To learn more about benchmarking, refer to the appropriate documentation.

Support

For Neural Magic Support, sign up or log in to our Deep Sparse Community Slack. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.

NLP Token Classification
Sparsifying Image Classification Models with SparseML