Neural Magic LogoNeural Magic Logo
Products
menu-icon
Products
DeepSparse EngineSparseMLSparseZoo
Use Cases
Embedding Extraction

Deploying Embedding Extraction Models With DeepSparse

This page explains how to deploy an Embedding Extraction Pipeline with DeepSparse.

Installation Requirements

This section requires the DeepSparse Server Install.

Getting Started

Confirm your machine is compatible with our hardware requirements.

Model Format

The Embedding Extraction Pipeline enables you to generate embeddings in any domain, meaning you can use it with any ONNX model. It (optionally) removes the projection head from the model, such that you can re-use SparseZoo models and custom models you have trained in the embedding extraction scenario.

There are two options for passing a model to the Embedding Extraction Pipeline:

  • Pass a Local ONNX File
  • Pass a SparseZoo Stub (which identifies an ONNX model in the SparseZoo)

Deployment APIs

DeepSparse provides both a Pipeline API and an out-of-the-box model server that can be used for end-to-end inference in either Python applications or as an HTTP endpoint. Both options provide similar specifications for configurations.

Python API

Pipelines are the default interface for running inference with DeepSparse.

Once a model is obtained, either through SparseML training or directly from SparseZoo, Pipelines can be used to handle pre-processing and post-processing of input, making it easy to add DeepSparse to your application.

HTTP Server

As an alternative to the Python API, DeepSparse Server allows you to serve an Embedding Extraction Pipeline over HTTP. Configuring the server uses the same parameters and schemas as the Pipelines. Once launched, a /docs endpoint is created with full endpoint descriptions and support for making sample requests. For more details, check out the documentation for DeepSparse Server.

Deployment Examples

The following section includes example usage of the Pipeline and Server APIs for various use cases. Each example uses a SparseZoo stub to pull down the model, but a local path to an ONNX file can also be passed as the model_path.

Python API

The Embedding Extraction Pipeline handles some useful actions around inference:

  • First, on initialization, the Pipeline (optionally) removes a projection head from a model. You can use the emb_extraction_layer argument to specify which layer to return. If your ONNX model has no projection head, you can set emb_extraction_layer=None (the default) to skip this step.

  • Second, as with all DeepSparse Pipelines, it handles pre-processing such that you can pass raw input. You will notice that in addition to the typical task argument used in Pipeline.create(), the Embedding Extraction Pipeline includes a base_task argument. This argument tells the Pipeline the domain of the model, such that the Pipeline can figure out what pre-processing to do.

This is an example of extracting the last layer from ResNet-50:

Download an image to use with the Pipeline.

wget https://raw.githubusercontent.com/neuralmagic/docs/rs/embedding-extraction-feature/files-for-examples/use-cases/embedding-extraction/goldfish.jpg
1from deepsparse import Pipeline
2
3# this step removes the projection head before compiling the model
4rn50_embedding_pipeline = Pipeline.create(
5 task="embedding-extraction",
6 base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds
7 model_path="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/channel20_pruned75_quant-none-vnni",
8 emb_extraction_layer=-3, # extracts last layer before projection head and softmax
9)
10
11# this step runs pre-processing, inference and returns an embedding
12embedding = rn50_embedding_pipeline(images="goldfish.jpg")
13print(len(embedding.embeddings[0][0]))
14
15# Output: 2048

Pipeline API: Transformers

With Transformers, you can use task=transformer_embedding_extraction for some extra utilities associated with embedding extraction.

The first utility is automatic embedding layer detection. If you set emb_extraction_layer=1 (the default), the Pipeline automatically detects the final transformer layer before the projection head and removes the projection head for you.

The second utility is automatic dimensionality reduction. You can use the extraction_strategy to perform a reduction on the sequence dimension rather than returning an embedding for each token. The options are:

  • per_token: returns the embedding for each token in the sequence (default)
  • reduce_mean: returns the average token of the sequence
  • reduce_max: returns the max token of the sequence
  • cls_token: returns the cls token from the sequence

An example using automatic embedding layer detection looks like this:

1from deepsparse import Pipeline
2
3bert_emb_pipeline = Pipeline.create(
4 task="transformers_embedding_extraction",
5 model_path="zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni",
6# emb_extraction_layer=-1, # (default: detect last layer)
7# extraction_strategy="per_token" # (default: concat embedding for each token)
8)
9
10input_sequence = "The generalized embedding extraction Pipeline is the best!"
11embedding = bert_emb_pipeline(input_sequence)
12print(len(embedding.embeddings[0]))
13
14# Output: 98304 << 93804 = 768 * 128 = hidden_size * sequence_length

An example returing the average embeddings of the tokens looks like this:

1from deepsparse import Pipeline
2
3bert_emb_pipeline = Pipeline.create(
4 task="transformers_embedding_extraction",
5 model_path="zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni",
6# emb_extraction_layer=-1, # (default: detect last layer)
7 extraction_strategy="reduce_mean"
8)
9
10input_sequence = "The generalized embedding extraction Pipeline is the best!"
11embedding = bert_emb_pipeline(input_sequence)
12print(len(embedding.embeddings[0]))
13
14# Output: 768 << reduced along sequence dimension

HTTP Server

The HTTP Server is a wrapper around the Pipeline. Therefore, it inherits all of the functionality we described above. It can be launched from the command line with a configuration file.

Config file:

1# config.yaml
2endpoints:
3 - task: embedding_extraction
4 model: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none
5 kwargs:
6 base_task: image_classification
7 emb_extraction_layer: -3

Download the config.yaml file and spinning up:

1wget https://raw.githubusercontent.com/neuralmagic/docs/rs/embedding-extraction-feature/files-for-examples/use-cases/embedding-extraction/config.yaml
2deepsparse.server --config_file config.yaml

Making a request:

1import requests, json
2url = "http://0.0.0.0:5543/predict/from_files"
3paths = ["goldfish.jpg"]
4files = [("request", open(img, 'rb')) for img in paths]
5resp = requests.post(url=url, files=files)
6result = json.loads(resp.text)
7
8print(len(result["embeddings"][0][0]))
9
10# Output: 2048
Object Detection Deployments With DeepSparse
What is Sparsification?