Deploying With DeepSparse on GCP Cloud Run
GCP's Cloud Run is a serverless, event-driven environment for making quick deployments for various applications including machine learning in various programming languages. Since DeepSparse runs on commodity CPUs, you can deploy DeepSparse on Cloud Run!
The DeepSparse GitHub repo contains a guided example for deploying a DeepSparse Pipeline on GCP Cloud Run for the token classification task.
Requirements
The listed steps can be easily completed using Python
and Bash
. The following tools and libraries are also required:
Before starting, replace the billing_id
PLACEHOLDER with your own GCP billing ID at the bottom of the SparseRun class in the endpoint.py
file. It should be alphanumeric and look something like this: XXXXX-XXXXX-XXXXX
.
Your billing id can be found in the BILLING
menu of your GCP console or you can run the following gcloud
command to get a list of all of your billing ids:
gcloud beta billing accounts list
Installation
git clone https://github.com/neuralmagic/deepsparse.git
cd deepsparse/examples/google-cloud-run
Model Configuration
The current server configuration is running token classification
. To alter the model, task or other parameters (e.g., number of cores, workers, routes, or batch size), edit the config.yaml
file.
Create Endpoint
Run the following command to build the Cloud Run endpoint.
python endpoint.py create
Call Endpoint
After the endpoint has been staged (~3 minutes), gcloud CLI will output the API Service URL. You can start making requests by passing this URL AND its route (found in config.yaml
) into the CloudRunClient
object.
For example, if the Service URL is https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app
and the route is /inference
, the URL passed into the client would be: https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference
Afterward, call your endpoint by passing in the text input:
from client import CloudRunClient
CR = CloudRunClient("https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference")
answer = CR.client("Drive from California to Texas!")
print(answer)
Additionally, you can also call the endpoint via a cURL command:
curl -X 'POST' \
'https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"inputs": [
"Drive from California to Texas!"
],
"is_split_into_words": false
}'
FYI, on the first cold start, it will take a ~60 seconds to get your first inference, but afterward, it should be in milliseconds.
Delete Endpoint
If you want to delete the Cloud Run endpoint, run:
python endpoint.py destroy