This page explains how to deploy an Embedding Extraction Pipeline with DeepSparse.
This section requires the DeepSparse Server Install.
Confirm your machine is compatible with our hardware requirements.
The Embedding Extraction Pipeline enables you to generate embeddings in any domain, meaning you can use it with any ONNX model. It (optionally) removes the projection head from the model, such that you can re-use SparseZoo models and custom models you have trained in the embedding extraction scenario.
There are two options for passing a model to the Embedding Extraction Pipeline:
DeepSparse provides both a Pipeline API and an out-of-the-box model server that can be used for end-to-end inference in either Python applications or as an HTTP endpoint. Both options provide similar specifications for configurations.
Pipelines are the default interface for running inference with DeepSparse.
Once a model is obtained, either through SparseML training or directly from SparseZoo, Pipelines can be used to handle pre-processing and post-processing of input, making it easy to add DeepSparse to your application.
As an alternative to the Python API, DeepSparse Server allows you to
serve an Embedding Extraction Pipeline over HTTP. Configuring the server uses the same parameters and schemas as the Pipelines.
Once launched, a
/docs endpoint is created with full endpoint descriptions and support for making sample requests. For more details, check out
the documentation for DeepSparse Server.
The following section includes example usage of the Pipeline and Server APIs for various use cases.
Each example uses a SparseZoo stub to pull down the model, but a local path to an ONNX file can also be passed as the
The Embedding Extraction Pipeline handles some useful actions around inference:
First, on initialization, the Pipeline (optionally) removes a projection head from a model. You can use
emb_extraction_layer argument to specify which layer to return.
If your ONNX model has no projection head, you can set
emb_extraction_layer=None (the default) to skip this step.
Second, as with all DeepSparse Pipelines, it handles pre-processing such that you can pass raw input.
You will notice that in addition to the typical
task argument used in
the Embedding Extraction Pipeline includes a
base_task argument. This argument tells the Pipeline the domain of the model,
such that the Pipeline can figure out what pre-processing to do.
This is an example of extracting the last layer from ResNet-50:
Download an image to use with the Pipeline.
1from deepsparse import Pipeline23# this step removes the projection head before compiling the model4rn50_embedding_pipeline = Pipeline.create(5 task="embedding-extraction",6 base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds7 model_path="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/channel20_pruned75_quant-none-vnni",8 emb_extraction_layer=-3, # extracts last layer before projection head and softmax9)1011# this step runs pre-processing, inference and returns an embedding12embedding = rn50_embedding_pipeline(images="goldfish.jpg")13print(len(embedding.embeddings))1415# Output: 2048
With Transformers, you can use
task=transformer_embedding_extraction for some extra utilities associated
with embedding extraction.
The first utility is automatic embedding layer detection. If you set
emb_extraction_layer=1 (the default),
the Pipeline automatically detects the final transformer layer before the projection head and removes the
projection head for you.
The second utility is automatic dimensionality reduction. You can use the
extraction_strategy to perform a reduction
on the sequence dimension rather than returning an embedding for each token. The options are:
per_token: returns the embedding for each token in the sequence (default)
reduce_mean: returns the average token of the sequence
reduce_max: returns the max token of the sequence
cls_token: returns the cls token from the sequence
An example using automatic embedding layer detection looks like this:
1from deepsparse import Pipeline23bert_emb_pipeline = Pipeline.create(4 task="transformers_embedding_extraction",5 model_path="zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni",6# emb_extraction_layer=-1, # (default: detect last layer)7# extraction_strategy="per_token" # (default: concat embedding for each token)8)910input_sequence = "The generalized embedding extraction Pipeline is the best!"11embedding = bert_emb_pipeline(input_sequence)12print(len(embedding.embeddings))1314# Output: 98304 << 93804 = 768 * 128 = hidden_size * sequence_length
An example returing the average embeddings of the tokens looks like this:
1from deepsparse import Pipeline23bert_emb_pipeline = Pipeline.create(4 task="transformers_embedding_extraction",5 model_path="zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni",6# emb_extraction_layer=-1, # (default: detect last layer)7 extraction_strategy="reduce_mean"8)910input_sequence = "The generalized embedding extraction Pipeline is the best!"11embedding = bert_emb_pipeline(input_sequence)12print(len(embedding.embeddings))1314# Output: 768 << reduced along sequence dimension
The HTTP Server is a wrapper around the Pipeline. Therefore, it inherits all of the functionality we described above. It can be launched from the command line with a configuration file.
1# config.yaml2endpoints:3 - task: embedding_extraction4 model: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none5 kwargs:6 base_task: image_classification7 emb_extraction_layer: -3
Download the config.yaml file and spinning up:
1wget https://raw.githubusercontent.com/neuralmagic/docs/rs/embedding-extraction-feature/files-for-examples/use-cases/embedding-extraction/config.yaml2deepsparse.server --config_file config.yaml
Making a request:
1import requests, json2url = "http://0.0.0.0:5543/predict/from_files"3paths = ["goldfish.jpg"]4files = [("request", open(img, 'rb')) for img in paths]5resp = requests.post(url=url, files=files)6result = json.loads(resp.text)78print(len(result["embeddings"]))910# Output: 2048