DeepSparse comes pre-installed with a server to enable easy and performant model deployments. The server provides an HTTP interface to communicate and run inferences on the deployed model rather than the Python APIs or CLIs. It is a production-ready model serving solution built on Neural Magic's sparsification solutions resulting in faster and cheaper deployments.
The inference server is built with performance and flexibility in mind, with support for multiple models and multiple simultaneous streams. It is also designed to be a plug-and-play solution for many ML Ops deployment solutions, including Kubernetes and Amazon SageMaker.
The examples below walk through use cases leveraging DeepSparse Server for deployment.
More documentation, models, use cases, and examples are continually being added. If you don't see one you're interested in, search the DeepSparse GitHub repo, the SparseML GitHub repo, the SparseZoo website, or ask in the Neural Magic Slack.