GCP's Cloud Run is a serverless, event-driven environment for making quick deployments for various applications including machine learning in various programming languages. Since DeepSparse runs on commodity CPUs, you can deploy DeepSparse on Cloud Run!
The DeepSparse GitHub repo contains a guided example for deploying a DeepSparse Pipeline on GCP Cloud Run for the token classification task.
The listed steps can be easily completed using Python
and Bash
. The following tools, and libraries are also required:
Before starting, replace the billing_id
PLACEHOLDER with your own GCP billing ID at the bottom of the SparseRun class in the endpoint.py
file. It should be alphanumeric and look something like this: XXXXX-XXXXX-XXXXX
.
Your billing id can be found in the BILLING
menu of your GCP console or you can run the following gcloud
command to get a list of all of your billing ids:
gcloud beta billing accounts list
1git clone https://github.com/neuralmagic/deepsparse.git2cd deepsparse/examples/google-cloud-run
The current server configuration is running token classification
. To alter the model, task or other parameters (e.g., number of cores, workers, routes or batch size), edit the config.yaml
file.
Run the following command to build the Cloud Run endpoint.
python endpoint.py create
After the endpoint has been staged (~3 minutes), gcloud CLI will output the API Service URL. You can start making requests by passing this URL AND its route (found in config.yaml
) into the CloudRunClient
object.
For example, if the Service URL is https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app
and the route is /inference
, the URL passed into the client would be: https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference
Afterwards, call your endpoint by passing in the text input:
1from client import CloudRunClient23CR = CloudRunClient("https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference")4answer = CR.client("Drive from California to Texas!")5print(answer)
[{'entity': 'LABEL_0','word': 'drive', ...},
{'entity': 'LABEL_0','word': 'from', ...},
{'entity': 'LABEL_5','word': 'california', ...},
{'entity': 'LABEL_0','word': 'to', ...},
{'entity': 'LABEL_5','word': 'texas', ...},
{'entity': 'LABEL_0','word': '!', ...}]
Additionally, you can also call the endpoint via a cURL command:
1curl -X 'POST' \2 'https://deepsparse-cloudrun-qsi36y4uoa-ue.a.run.app/inference' \3 -H 'accept: application/json' \4 -H 'Content-Type: application/json' \5 -d '{6 "inputs": [7 "Drive from California to Texas!"8 ],9 "is_split_into_words": false10}'
FYI, on the first cold start, it will take a ~60 seconds to get your first inference, but afterwards, it should be in milliseconds.
If you want to delete the Cloud Run endpoint, run:
python endpoint.py destroy