Deploying LLMs
This guide focuses on the deployment of LLMs for text-generation tasks. You'll learn how to:
- Create Pipelines: Integrate sparsified LLMs from SparseZoo into your deployments at the Python API level.
- Set Up Servers: Run LLMs as performant HTTP services using DeepSparse Server.
- Benchmark Performance: Measure and compare the speed and accuracy of sparsified vs. baseline models.
Prerequisites
- Deployment Environment: A system that meets the minimum hardware and software requirements as outlined in the Install Guide.
- DeepSparse LLM Installation: An environment with DeepSparse for LLMs installed as outlined in the Install Guide.
- Background: Familiarity with Generative AI and deploying ML models through Python and HTTP APIs is recommended.
Deploying a Sparsified Llama Model
We'll use a sparsified Llama 2 7B model (chat-focused) from the SparseZoo to demonstrate the deployment process. This model is pruned to 50% sparsity and quantized to 8 bits for weights and activations, resulting in a smaller, faster, and more efficient model. The model is referenced by the following SparseZoo stub:
zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized
For other models that work with these examples, browse through the Generative AI models in the SparseZoo to find one that fits your needs.
Pipeline
In this section, you'll learn how to directly integrate a sparsified LLM from SparseZoo into your Python code, enabling text generation capabilities within your application. DeepSparse Pipelines are designed to mirror the Hugging Face Transformers API closely, ensuring a familiar experience if you've worked with Transformers before. The following code demonstrates how to create a pipeline for text generation using the sparsified Llama 2 7B model:
from deepsparse import TextGeneration
pipeline = TextGeneration(
"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized"
)
result = pipeline("Large language models are")
print(result)
The resulting output printed to the console will be the generated text from the model based on the input prompt.
Server
To make your LLM accessible as a web service, you'll wrap it in a DeepSparse Server. The Server lets you interact with the model using HTTP requests, making integrating with web applications, microservices, or other systems easy. DeepSparse Server has an OpenAI-compatible integration for request and response formats for seamless integration. The following command starts a DeepSparse Server with the sparsified Llama 2 7B model:
deepsparse.server \
--integration openai \
"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized"
With the server running, you can send an HTTP request that conforms to the OpenAI spec to generate text.
Below are examples of using curl
and python
to send a request to the server:
- Python
- Bash
import requests
import json
url = "http://localhost:5543/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized",
"messages": "Large language models are",
"stream": True
}
response = requests.post(url, headers=headers, data=json.dumps(data))
if response.status_code == 200:
for chunk in response.iter_content(chunk_size=128):
print(chunk.decode('utf-8')) # Decode and print each data chunk
else:
print("Request failed with status code:", response.status_code)
curl http://localhost:5543/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized",
"messages": "Large language models are",
"stream": true
}'
The resulting output will be the generated text from the model based on the input prompt.
Performance
It's crucial to assess the performance of your deployed LLM. Neural Magic provides tools for benchmarking speed (e.g., tokens per second) and evaluating accuracy using established metrics like perplexity.
Benchmarking
Baseline
To compare the performance of the sparsified model, we utilize a baseline, unoptimized version of the model from the SparseZoo. The stub for the corresponding model is:
zoo:llama2-7b-llama2_pretrain-base
The following command utilizes the baseline stub and DeepSparse to establish the unoptimized model's performance:
- Python
- Bash
from deepsparse import benchmark_model
result = benchmark_model("zoo:llama2-7b-llama2_pretrain-base")
print(result)
deepsparse.benchmark "zoo:llama2-7b-llama2_pretrain-base"
On an 8-core AMD CPU, the baseline model achieves a throughput of around 2.7
tokens per second.
The following command utilizes the sparsified model and DeepSparse to establish the optimized model's performance:
- Python
- Bash
from deepsparse import benchmark_model
result = benchmark_model("zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized")
print(result)
deepsparse.benchmark "zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized"
On an 8-core AMD CPU, the sparsified model achieves a throughput of around 13.1
tokens per second, which is 4.9
times faster than the baseline model!
Accuracy
The following command utilizes the sparsified model and DeepSparse to establish the optimized model's perplexity on the OpenAI HumanEval dataset:
- Python
- Bash
from deepsparse import evaluate
eval = evaluate(
"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized",
datasets="openai_humaneval",
integration="perplexity"
)
print(eval)
deepsparse.evaluate \
"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned50_quantized" \
--datasets "openai_humaneval" \
--integration "perplexity"
The above command will result in a perplexity evaluation of around 3.6
on the OpenAI HumanEval dataset.
Want to dive into more about deployments with Neural Magic? Here are a few paths to consider:
- Specialize in LLMs: Dive deeper into text generation techniques within our LLMs section.
- Expand to Other Domains: Explore how to deploy optimized models for Computer Vision or Natural Language Processing tasks.
- Tailor to Your Needs: Learn about flexible deployment options in our Custom Integrations section.