Skip to main content
Version: nightly

nm-vllm

GitHubโ€‹

nm-vllm

nm-vllm is a high-throughput and memory-efficient inference and serving engine for LLMs.

Documentationโ€‹

๐Ÿ“„๏ธ Releases

Versions

๐Ÿ“„๏ธ Deploying with Docker

nm-vllm offers official docker image for deployment.

๐Ÿ“„๏ธ OpenAI Compatible Server

nm-vllm provides an HTTP server that implements OpenAI's Completions and Chat API.

Overviewโ€‹

vLLM is a fast and easy-to-use library for LLM inference that Neural Magic regularly contributes upstream improvements to. This fork, nm-vllm is our opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

Installationโ€‹

The nm-vllm PyPi package includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.

Install it using pip:

pip install nm-vllm

For utilizing weight-sparsity kernels, such as through sparsity="sparse_w16a16", you can extend the installation with the sparsity extras:

pip install nm-vllm[sparse]

You can also build and install nm-vllm from source (this will take ~10 minutes):

git clone https://github.com/neuralmagic/nm-vllm.git
cd nm-vllm
pip install -e .

Quickstartโ€‹

Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, neuralmagic and nm-testing.

A collection of ready-to-use SparseGPT and GPTQ models in inference optimized marlin format are available on Hugging Face

Model Inference with Marlin (4-bit Quantization)โ€‹

Marlin is an extremely optimized FP16xINT4 matmul kernel aimed at LLM inference that can deliver close to ideal (4x) speedups up to batchsizes of 16-32 tokens. To use Marlin within nm-vllm, simply pass the Marlin quantized directly to the engine. It will detect the quantization from the model's config.

Here is a demonstraiton with a 4-bit quantized OpenHermes Mistral model:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"
model = LLM(model_id, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)

messages = [
{"role": "user", "content": "What is synthetic data in machine learning?"},
]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Model Inference with Weight Sparsityโ€‹

For a quick demonstration, here's how to run a small 50% sparse llama2-110M model trained on storytelling:

from vllm import LLM, SamplingParams

model = LLM(
"neuralmagic/llama2.c-stories110M-pruned50",
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Here is a more realistic example of running a 50% sparse OpenHermes 2.5 Mistral 7B model finetuned for instruction-following:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50"
model = LLM(model_id, sparsity="sparse_w16a16", max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)

messages = [
{"role": "user", "content": "What is sparsity in deep learning?"},
]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = model.generate(formatted_prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16" argument:

from vllm import LLM, SamplingParams

model = LLM("neuralmagic/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Integration with OpenAI-Compatible Serverโ€‹

You can also quickly use the same flow with an OpenAI-compatible model server:

python -m vllm.entrypoints.openai.api_server \
--model neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50 \
--sparsity sparse_w16a16 \
--max-model-len 4096

Quantized Inference Performanceโ€‹

Developed in collaboration with IST-Austria, GPTQ is the leading quantization algorithm for LLMs, which enables compressing the model weights from 16 bits to 4 bits with limited impact on accuracy. nm-vllm includes support for the recently-developed Marlin kernels for accelerating GPTQ models. Prior to Marlin, the existing kernels for INT4 inference failed to scale in scenarios with multiple concurrent users.

Marlin Performance

Sparse Inference Performanceโ€‹

Developed in collaboration with IST-Austria, SparseGPT and Sparse Fine-tuning are the leading algorithms for pruning LLMs, which enables removing at least half of model weights with limited impact on accuracy.

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Sparse Memory CompressionSparse Inference Performance