Neural Magic LogoNeural Magic Logo
DeepSparse EngineSparseMLSparseZoo
Use Cases
Deploying DeepSparse
AWS SageMaker

Deploying with DeepSparse on AWS SageMaker

Amazon SageMaker offers an easy-to-use infrastructure for deploying deep learning models at scale. This directory provides a guided example for deploying a DeepSparse inference server on SageMaker for the question answering NLP task. Deployments benefit from both sparse-CPU acceleration with DeepSparse and automatic scaling from SageMaker.

Installation Requirements

The listed steps can be easily completed using a python and bash. The following credentials, tools, and libraries are also required:

  • The AWS CLI version 2.X that is configured. Double check if the region that is configured in your AWS CLI matches the region in the SparseMaker class found in the file. Currently, the default region being used is us-east-1.
  • The ARN of your AWS role requires access to full SageMaker permissions.
    • AmazonSageMakerFullAccess
  • In the following steps, we will refer to this as ROLE_ARN. It should take the form "arn:aws:iam::XXX:role/service-role/XXX". In addition to role permissions, make sure the AWS user who configured the AWS CLI configuration has ECR/SageMaker permissions.
  • Docker and the docker cli.
  • The boto3 python AWS sdk (pip install boto3).

Quick Start

1git clone
2cd deepsparse/examples/aws-sagemaker
3pip install -r requirements.txt

Before starting, replace the role_arn PLACEHOLDER string with your AWS ARN at the bottom of SparseMaker class on the file. Your ARN should look something like this: "arn:aws:iam::XXX:role/service-role/XXX"

Run the following command to build your SageMaker endpoint.

python create

After the endpoint has been staged (~1 minute), you can start making requests by passing your endpoint region name and your endpoint name. Afterwards you can run inference by passing in your question and context:

1from qa_client import Endpoint
4qa = Endpoint("us-east-1", "question-answering-example-endpoint")
5answer = qa.predict(question="who is batman?", context="Mark is batman.")

answer: b'{"score":0.6484262943267822,"answer":"Mark","start":0,"end":4}'

If you want to delete your endpoint, please use:

python destroy

Continue reading to learn more about the files in this directory, the build requirements, and a descriptive step-by-step guide for launching a SageMaker endpoint.


In addition to the step-by-step instructions below, the directory contains additional files to aid in the deployment.


The included Dockerfile builds an image on top of the standard python:3.8 image with deepsparse installed and creates an executable command serve that runs deepsparse.server on port 8080. SageMaker will execute this image by running docker run serve and expects the image to serve inference requests at the invocations/ endpoint.

For general customization of the server, changes should not need to be made to the Dockerfile, but to the config.yaml file that the Dockerfile reads from instead.


config.yaml is used to configure the DeepSparse server running in the Dockerfile. The config must contain the line integration: sagemaker so endpoints may be provisioned correctly to match SageMaker specifications.

Notice that the model_path and task are set to run a sparse-quantized question-answering model from SparseZoo. To use a model directory stored in s3, set model_path to /opt/ml/model in the config and add ModelDataUrl=<MODEL-S3-PATH> to the CreateModel arguments. SageMaker will automatically copy the files from the s3 path into /opt/ml/model which the server can then read from.

Bash script for pushing your local Docker image to the AWS ECR repository.

Contains the SparseMaker object for automating the build of a SageMaker endpoint from a Docker Image. You have the option to customize the parameters of the class in order to match the prefered state of your deployment.

Contains a client object for making requests to the SageMaker inference endpoint for the question answering task.

More information on the DeepSparse server and its configuration can be found here.

Deploying to SageMaker

The following steps are required to provision and deploy DeepSparse to SageMaker for inference:

  • Build the DeepSparse-SageMaker Dockerfile into a local docker image
  • Create an Amazon ECR repository to host the image
  • Push the image to the ECR repository
  • Create a SageMaker Model that reads from the hosted ECR image
  • Build a SageMaker EndpointConfig that defines how to provision the model deployment
  • Launch the SageMaker Endpoint defined by the Model and EndpointConfig

Building the DeepSparse-SageMaker Image Locally

The Dockerfile can be build from this directory from a bash shell using the following command. The image will be tagged locally as deepsparse-sagemaker-example.

docker build -t deepsparse-sagemaker-example .

Creating an ECR Repository

The following code snippet can be used in Python to create an ECR repository. The region_name can be swapped to a preferred region. The repository will be named deepsparse-sagemaker. If the repository is already created, this step may be skipped.

1import boto3
3ecr = boto3.client("ecr", region_name='us-east-1')
4create_repository_res = ecr.create_repository(repositoryName="deepsparse-sagemaker")

Pushing the Local Image to the ECR Repository

Once the image is built and the ECR repository is created, the image can be pushed using the following bash commands.

1account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
2region=$(aws configure get region)
5aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account
8docker tag deepsparse-sagemaker-example:latest $fullname
9docker push $fullname

An abbreviated successful output will look like:

1Login Succeeded
2The push refers to repository []
33c2284f66840: Preparing
408fa02ce37eb: Preparing
5a037458de4e0: Preparing
6bafdbe68e4ae: Preparing
7a13c519c6361: Preparing
86817758dd480: Waiting
96d95196cbe50: Waiting
10e9872b0f234f: Waiting
11c18b71656bcf: Waiting
122174eedecc00: Waiting
1303ea99cd5cd8: Pushed
14585a375d16ff: Pushed
155bdcc8e2060c: Pushed
16latest: digest: sha256:XXX size: 3884

Creating a SageMaker Model

A SageMaker Model can now be created referencing the pushed image. The example model will be named question-answering-example. As mentioned in the requirements, ROLE_ARN should be a string arn of an AWS role with full access to SageMaker.

1import boto3
3sm_boto3 = boto3.client("sagemaker", region_name="us-east-1")
5region = boto3.Session().region_name
6account_id = boto3.client("sts").get_caller_identity()["Account"]
8image_uri = "{}.dkr.ecr.{}".format(account_id, region)
10create_model_res = sm_boto3.create_model(
11 ModelName="question-answering-example",
12 Containers=[
13 {
14 "Image": image_uri,
15 },
16 ],
17 ExecutionRoleArn=ROLE_ARN,
18 EnableNetworkIsolation=False,

More information about options for configuring SageMaker Model instances can be found here.

Building a SageMaker EndpointConfig

The EndpointConfig is used to set the instance type to provision, how many, scaling rules, and other deployment settings. The following code snippet defines an endpoint with a single machine using an ml.c5.large CPU.

1model_name = "question-answering-example" # model defined above
2initial_instance_count = 1
3instance_type = "ml.c5.2xlarge" # 8 vcpus
5variant_name = "QuestionAnsweringDeepSparseDemo" # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
7production_variants = [
8 {
9 "VariantName": variant_name,
10 "ModelName": model_name,
11 "InitialInstanceCount": initial_instance_count,
12 "InstanceType": instance_type,
13 }
16endpoint_config_name = "QuestionAnsweringExampleConfig" # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
18endpoint_config = {
19 "EndpointConfigName": endpoint_config_name,
20 "ProductionVariants": production_variants,
23endpoint_config_res = sm_boto3.create_endpoint_config(**endpoint_config)

Launching a SageMaker Endpoint

Once the EndpointConfig is defined, the endpoint can be easily launched using the create_endpoint command:

1endpoint_name = "question-answering-example-endpoint"
2endpoint_res = sm_boto3.create_endpoint(
3 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name

After creating the endpoint, its status can be checked by running the following. Initially, the EndpointStatus will be Creating. Checking after the image is successfully launched, it will be InService. If there are any errors, it will become Failed.

1from pprint import pprint

Making a Request to the Endpoint

After the endpoint is in service, requests can be made to it through the invoke_endpoint api. Inputs will be passed as a JSON payload.

1import json
3sm_runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")
5body = json.dumps(
6 dict(
7 question="Where do I live?",
8 context="I am a student and I live in Cambridge",
9 )
12content_type = "application/json"
13accept = "text/plain"
15res = sm_runtime.invoke_endpoint(
16 EndpointName=endpoint_name,
17 Body=body,
18 ContentType=content_type,
19 Accept=accept,


The model and endpoint can be deleted with the following commands:


Next Steps

These steps create an invokable SageMaker inference endpoint powered by the DeepSparse Engine. The EndpointConfig settings may be adjusted to set instance scaling rules based on deployment needs.

More information on deploying custom models with SageMaker can be found here.

Deploying with the DeepSparse Server
Using/Creating a DeepSparse Docker Image