NVIDIA Dynamo: Open-Source Library Optimizes AI Reasoning
Inference Optimizations by NVIDIA Blackwell and Dynamo Reduce Costs and Improve Performance for Scaling Test-Time Compute Increase DeepSeek-R1 throughput 30x
Large language model (LLM) processing and production on GPUs are disaggregated and coordinated to speed up inference communication among hundreds of GPUs
NVIDIA Dynamo's intelligent inference optimisations increase DeepSeek-R1 token production by over 30x per GPU on a huge cluster of GB200 NVL72 racks
NVIDIA Dynamo is open source and supports PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, allowing researchers, startups, and enterprises to optimize AI model
NVIDIA Dynamo spreads the KV cache, which stores inference system knowledge from prior requests, across thousands of GPUs
AI supplier Cohere plans to integrate NVIDIA Dynamo to enable agentic AI in its Command models
Together AI, the AI Acceleration Cloud, wants to integrate its Together Inference Engine with NVIDIA Dynamo to scale inference workloads across GPU nodes
NVIDIA AI Enterprise will provide production-grade security, support, and stability to NVIDIA Dynamo in the future
An open-source modular inference system called NVIDIA Dynamo is used to support generative AI models in distributed settings
NVIDIA Dynamo supports all major frameworks, including TensorRT-LLM, vLLM, SGLang, PyTorch, and others, so you can quickly create new generative AI models
NVIDIA Dynamo optimises this process by reusing context, offloading to cheap memory, minimising re-computation, and cutting inference costs
Open and modular, NVIDIA Dynamo may integrate its components into the engine to handle more requests while optimising computational investment and resource use