Deploying with Docker
nm-vllm offers official docker image for deployment.
The image can be used to run OpenAI compatible server.
The image is available on Github Packages as neuralmagic/nm-vllm-openai
.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
ghcr.io/neuralmagic/nm-vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
You can either use the
ipc=host
flag or--shm-size
flag to allow the container to access the host's shared memory. nm-vllm uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.
You can build and run nm-vllm from source via the provided dockerfile. To build nm-vllm:
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag neuralmagic/nm-vllm-openai # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
By default nm-vllm will build for all GPU types for widest distribution. If you are just building for the current GPU type the machine is running on, you can add the argument
--build-arg torch_cuda_arch_list=""
for nm-vllm to find the current GPU type and build for that.
To run nm-vllm:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
neuralmagic/nm-vllm-openai <args...>