Neural Magic LogoNeural Magic Logo
DeepSparse EngineSparseMLSparseZoo
User Guides
Deploying DeepSparse
Amazon SageMaker

Deploying with DeepSparse on Anazon SageMaker

Amazon SageMaker offers an easy-to-use infrastructure for deploying deep learning models at scale. This directory provides a guided example for deploying a DeepSparse inference server on SageMaker for the question answering NLP task. Deployments benefit from both sparse-CPU acceleration with DeepSparse and automatic scaling from SageMaker.

Installation Requirements

The listed steps can be easily completed using python and bash. The following credentials, tools, and libraries are also required:

  • AWS CLI version 2.X that is configured. Double-check if the region that is configured in your AWS CLI matches the region in the SparseMaker class found in the file. Currently, the default region being used is us-east-1.
  • The ARN of your AWS role requires access to full SageMaker permissions.
    • AmazonSageMakerFullAccess
    • In the following steps, we will refer to this as ROLE_ARN. It should take the form "arn:aws:iam::XXX:role/service-role/XXX". In addition to role permissions, make sure the AWS user who configured the AWS CLI configuration has ECR/SageMaker permissions.
  • Docker and the docker CLI.
  • The boto3 Python AWS SDK (pip install boto3).

Quick Start

1git clone
2cd deepsparse/examples/aws-sagemaker
3pip install -r requirements.txt

Before starting, replace the role_arn PLACEHOLDER string with your AWS ARN at the bottom of SparseMaker class on the file. Your ARN should look something like this: "arn:aws:iam::XXX:role/service-role/XXX"

Run the following command to build your SageMaker endpoint.

python create

After the endpoint has been staged (~1 minute), you can start making requests by passing your endpoint region name and your endpoint name. Afterwards, you can run inference by passing in your question and context:

1from qa_client import Endpoint
4qa = Endpoint("us-east-1", "question-answering-example-endpoint")
5answer = qa.predict(question="who is batman?", context="Mark is batman.")

The answer is: b'{"score":0.6484262943267822,"answer":"Mark","start":0,"end":4}'

If you want to delete your endpoint, use:

python destroy

Continue reading to learn more about the files in this directory, the build requirements, and a descriptive step-by-step guide for launching a SageMaker endpoint.


In addition to the step-by-step instructions below, the directory contains files to aid in the deployment.


The included Dockerfile builds an image on top of the standard python:3.8 image with deepsparse installed, and creates an executable command serve that runs deepsparse.server on port 8080. SageMaker will execute this image by running docker run serve and expects the image to serve inference requests at the invocations/ endpoint.

For general customization of the server, changes should not need to be made to the Dockerfile but, instead, to the config.yaml file from which the Dockerfile reads.


config.yaml is used to configure DeepSparse Server running in the Dockerfile. The configuration must contain the line integration: sagemaker so endpoints may be provisioned correctly to match SageMaker specifications.

Notice that the model_path and task are set to run a sparse-quantized question answering model from SparseZoo. To use a model directory stored in s3, set model_path to /opt/ml/model in the configuration and add ModelDataUrl=<MODEL-S3-PATH> to the CreateModel arguments. SageMaker will automatically copy the files from the s3 path into /opt/ml/model from which the server then can read.

This is a Bash script for pushing your local Docker image to the AWS ECR repository.

This file contains the SparseMaker object for automating the build of a SageMaker endpoint from a Docker image. You have the option to customize the parameters of the class in order to match the prefered state of your deployment.

This file contains a client object for making requests to the SageMaker inference endpoint for the question answering task.

Review DeepSparse Server for more information about the server and its configuration.

Deploying to SageMaker

The following steps are required to provision and deploy DeepSparse to SageMaker for inference:

  • Build the DeepSparse-SageMaker Dockerfile into a local docker image.
  • Create an Amazon ECR repository to host the image.
  • Push the image to the ECR repository.
  • Create a SageMaker Model that reads from the hosted ECR image.
  • Build a SageMaker EndpointConfig that defines how to provision the model deployment.
  • Launch the SageMaker Endpoint defined by the Model and EndpointConfig.

Building the DeepSparse-SageMaker Image Locally

Build the Dockerfile from this directory from a bash shell using the following command. The image will be tagged locally as deepsparse-sagemaker-example.

docker build -t deepsparse-sagemaker-example .

Creating an ECR Repository

Use the following code snippet in Python to create an ECR repository. The region_name can be swapped to a preferred region. The repository will be named deepsparse-sagemaker. If the repository is already created, you may skip this step.

1import boto3
3ecr = boto3.client("ecr", region_name='us-east-1')
4create_repository_res = ecr.create_repository(repositoryName="deepsparse-sagemaker")

Pushing the Local Image to the ECR Repository

Once the image is built and the ECR repository is created, you can push the image using the following bash commands.

1account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
2region=$(aws configure get region)
5aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account
8docker tag deepsparse-sagemaker-example:latest $fullname
9docker push $fullname

An abbreviated successful output will look like:

1Login Succeeded
2The push refers to repository []
33c2284f66840: Preparing
408fa02ce37eb: Preparing
5a037458de4e0: Preparing
6bafdbe68e4ae: Preparing
7a13c519c6361: Preparing
86817758dd480: Waiting
96d95196cbe50: Waiting
10e9872b0f234f: Waiting
11c18b71656bcf: Waiting
122174eedecc00: Waiting
1303ea99cd5cd8: Pushed
14585a375d16ff: Pushed
155bdcc8e2060c: Pushed
16latest: digest: sha256:XXX size: 3884

Creating a SageMaker Model

Create a SageMaker Model referencing the pushed image. The example model will be named question-answering-example. As mentioned in the requirements, ROLE_ARN should be a string arn of an AWS role with full access to SageMaker.

1import boto3
3sm_boto3 = boto3.client("sagemaker", region_name="us-east-1")
5region = boto3.Session().region_name
6account_id = boto3.client("sts").get_caller_identity()["Account"]
8image_uri = "{}.dkr.ecr.{}".format(account_id, region)
10create_model_res = sm_boto3.create_model(
11 ModelName="question-answering-example",
12 Containers=[
13 {
14 "Image": image_uri,
15 },
16 ],
17 ExecutionRoleArn=ROLE_ARN,
18 EnableNetworkIsolation=False,

Refer to AWS documentation for more information about options for configuring SageMaker Model instances.

Building a SageMaker EndpointConfig

The EndpointConfig is used to set the instance type to provision, how many, scaling rules, and other deployment settings. The following code snippet defines an endpoint with a single machine using an ml.c5.large CPU.

1model_name = "question-answering-example" # model defined above
2initial_instance_count = 1
3instance_type = "ml.c5.2xlarge" # 8 vcpus
5variant_name = "QuestionAnsweringDeepSparseDemo" # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
7production_variants = [
8 {
9 "VariantName": variant_name,
10 "ModelName": model_name,
11 "InitialInstanceCount": initial_instance_count,
12 "InstanceType": instance_type,
13 }
16endpoint_config_name = "QuestionAnsweringExampleConfig" # ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
18endpoint_config = {
19 "EndpointConfigName": endpoint_config_name,
20 "ProductionVariants": production_variants,
23endpoint_config_res = sm_boto3.create_endpoint_config(**endpoint_config)

Launching a SageMaker Endpoint

Once the EndpointConfig is defined, launch the endpoint using the create_endpoint command:

1endpoint_name = "question-answering-example-endpoint"
2endpoint_res = sm_boto3.create_endpoint(
3 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name

After creating the endpoint, you can check its status by running the following. Initially, the EndpointStatus will be Creating. Checking after the image is successfully launched, it will be InService. If there are any errors, it will be Failed.

1from pprint import pprint

Making a Request to the Endpoint

After the endpoint is in service, you can make requests to it through the invoke_endpoint API. Inputs will be passed as a JSON payload.

1import json
3sm_runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")
5body = json.dumps(
6 dict(
7 question="Where do I live?",
8 context="I am a student and I live in Cambridge",
9 )
12content_type = "application/json"
13accept = "text/plain"
15res = sm_runtime.invoke_endpoint(
16 EndpointName=endpoint_name,
17 Body=body,
18 ContentType=content_type,
19 Accept=accept,


You can delete the model and endpoint with the following commands:


Next Steps

These steps create an invokable SageMaker inference endpoint powered by DeepSparse.
The EndpointConfig settings may be adjusted to set instance scaling rules based on deployment needs.

Refer to AWS documentation for more information on deploying custom models with SageMaker.

Deploying with DeepSparse Server
Using DeepSparse on AWS Lambda