Version: nightly

Deploy LLMs With DeepSparse

DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with SparseML, our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.

Neural Magic supports performant LLM inference in DeepSparse with:

Sparse kernels for faster inference and memory savings from unstructured sparse weights
8-bit weight and activation quantization support
Efficient management of cached attention keys and values for minimal latency
Continuous batching to optimize output tokens generation throughput
Streaming outputs
OpenAI-compatible API server

mpt-chat-comparison

Here's a minimal example showing how to use DeepSparse for text generation with TinyStories-1M, a very small model for generating stories.

from deepsparse import TextGeneration
pipeline = TextGeneration(model="hf:mgoin/TinyStories-1M-ds")
print(pipeline("Once upon a time, ").generations[0].text)
"""
One day, a little girl named Lily went to the park with her mommy. They saw a big slide and wanted to slide down the slide. Lily said, "Mommy, can I go on the slide?" Her mommy said, "Yes, you can go on the slide."
"""

Check out the [TextGeneration documentation for usage details.](ADD LINK)

Supported LLM Architectures

DeepSparse supports many Hugging Face models through ONNX export through SparseML, including the following architectures:

LLaMA & LLaMA-2 - neuralmagic/Llama2-7b-chat-pruned50-quant-ds, neuralmagic/Nous-Hermes-llama-2-7b-pruned50-quant-ds, neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds - SparseZoo Models
Mistral - neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50-quant-ds - SparseZoo Models
MPT - neuralmagic/mpt-7b-chat-pruned50-quant-ds - SparseZoo Models
OPT - facebook/opt-6.7b, etc. - SparseZoo Models
SOLAR - neuralmagic/Nous-Hermes-2-SOLAR-10.7B-pruned50-quant-ds

Making New DeepSparse-Optimized Models

See the guide for compressing LLMs with SparseGPT

Offline Batched Inference

A notable feature of the DeepSparse TextGeneration class is the availability to specify continuous_batch_sizes, which allows for efficient batch processing of multiple prompts simultaneously, optimizing resource usage and accelerating token generation throughput.

This example includes a set of diverse prompts and uses continuous_batch_sizes=[4] in order to be able to generate output tokens for 4 of the prompt requests simultaneously.

from deepsparse import TextGeneration

model = TextGeneration(
    model_path="hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds",
    continuous_batch_sizes=[4],
)

prompts = [
    "Beneath the ancient oak tree ",
    "In a world where time flows backwards ",
    "When the last star in the universe flickered ",
    "Inside the labyrinth of endless mirrors ",
    "Under the neon lights of a forgotten city ",
    "As the clock struck midnight in the enchanted forest ",
    "Amidst the whispers of a haunted library ",
    "On the edge of a dream, where reality blurs ",
]
outputs = model(prompt=prompts, max_new_tokens=50)

for i, gen in enumerate(outputs.generations):
    print(f"#{i}: {prompts[i]}{gen.text}")

OpenAI-Compatible Server

DeepSparse LLM can be deployed as a server that implements the OpenAI API protocol. This allows DeepSparse to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:5543. You can specify the address with --host and --port arguments. The server currently hosts one model at a time (TinyLlama-Chat in the command below) and implements list models, create chat completion, and create completion endpoints.

Start the server:

deepsparse.server --task text-generation --integration openai --model_path hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds

This server can be queried in the same format as OpenAI API. For example, list the models:

curl http://localhost:5543/v1/models

Using OpenAI Completions API With DeepSparse

Query the model with input prompts:

curl http://localhost:5543/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the openai python package:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5543/v1", api_key="EMPTY")
model = client.models.list().data[0][1]
print(f"Accessing model API '{model}'")

completion = client.completions.create(model=model, prompt="San Francisco is a")
print("Completion result:", completion)

Using OpenAI Chat API With DeepSparse

The DeepSparse Server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

Querying the model using OpenAI Chat API:

You can use the create chat completion endpoint to communicate with the model in a chat-like interface:

curl http://localhost:5543/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.

Deploy LLMs With DeepSparse

Supported LLM Architectures

Making New DeepSparse-Optimized Models

Offline Batched Inference

OpenAI-Compatible Server

Using OpenAI Completions API With DeepSparse

Using OpenAI Chat API With DeepSparse

Content

Actions

Support

Issues

Deploy LLMs With DeepSparse

Supported LLM Architectures​

Making New DeepSparse-Optimized Models​

Offline Batched Inference​

OpenAI-Compatible Server​

Using OpenAI Completions API With DeepSparse​

Using OpenAI Chat API With DeepSparse​

Content

Actions

Support

Issues

Supported LLM Architectures

Making New DeepSparse-Optimized Models

Offline Batched Inference

OpenAI-Compatible Server

Using OpenAI Completions API With DeepSparse

Using OpenAI Chat API With DeepSparse