Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparse EngineSparseMLSparseZoo
Use Cases
Deploying DeepSparse
DeepSparse Server

Deploying with the DeepSparse Server

This section explains how to deploy with DeepSparse Server

Installation Requirements

This section requires the DeepSparse Server Install.

Usage

The DeepSparse Server allows you to serve models and Pipelines for deployment in HTTP. The server runs on top of the popular FastAPI web framework and Uvicorn web server. The server supports any task from DeepSparse, such as Pipelines including NLP, image classification, and object detection tasks. An updated list of available tasks can be found here

Run the help CLI to lookup the available arguments.

$deepsparse.server --help
>Usage: deepsparse.server [OPTIONS] COMMAND [ARGS]...
>
>Start a DeepSparse inference server for serving the models and pipelines.
>
>1. `deepsparse.server config [OPTIONS] <config path>`
>
>2. `deepsparse.server task [OPTIONS] <task>
>
>Examples for using the server:
>
>`deepsparse.server config server-config.yaml`
>
>`deepsparse.server task question_answering --batch-size 2`
>
>`deepsparse.server task question_answering --host "0.0.0.0"`
>
>Example config.yaml for serving:
>
>\```yaml
>num_cores: 2
>num_workers: 2
>endpoints:
>- task: question_answering
>route: /unpruned/predict
>model: zoo:some/zoo/stub
>- task: question_answering
>route: /pruned/predict
>model: /path/to/local/model
>\```
>
>Options:
>--help Show this message and exit.
>
>Commands:
>config Run the server using configuration from a .yaml file.
>task Run the server using configuration with CLI options, which can...

Single Model Inference

Example CLI command for serving a single model for the question answering task:

1deepsparse.server \
2 task question_answering \
3 --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

To make a request to your server, use the requests library and pass the request URL:

1import requests
2
3url = "http://localhost:5543/predict"
4
5obj = {
6 "question": "Who is Mark?",
7 "context": "Mark is batman."
8}
9
10response = requests.post(url, json=obj)

In addition, you can make a request with a curl command from terminal:

1curl -X POST \
2 'http://localhost:5543/predict' \
3 -H 'accept: application/json' \
4 -H 'Content-Type: application/json' \
5 -d '{
6 "question": "Who is Mark?",
7 "context": "Mark is batman."
8}'

Multiple Model Inference

To serve multiple models you can build a config.yaml file. In the sample YAML file below, we are defining two BERT models to be served by the deepsparse.server for the question answering task:

1num_cores: 2
2num_workers: 2
3endpoints:
4 - task: question_answering
5 route: /unpruned/predict
6 model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none
7 batch_size: 1
8 - task: question_answering
9 route: /pruned/predict
10 model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni
11 batch_size: 1

You can now run the server with the config file path using the config sub command:

deepsparse.server config config.yaml

You can send requests to a specific model by appending the model's alias from the config.yaml to the end of the request url. For example, to call the second model, you can send a request to its configured route:

1import requests
2
3url = "http://localhost:5543/pruned/predict"
4
5obj = {
6 "question": "Who is Mark?",
7 "context": "Mark is batman."
8}
9
10response = requests.post(url, json=obj)

💡 PRO TIP 💡: While your server is running, you can always use the awesome swagger UI that's built into FastAPI to view your model's pipeline POST routes. The UI also enables you to easily make sample requests to your server. All you need is to add /docs at the end of your host URL:

localhost:5543/docs

alt text

Object Detection Deployments with DeepSparse
Deploying with DeepSparse on AWS SageMaker