NVIDIA Dynamo: Open-Source Library Optimizes AI Reasoning

Inference Optimizations by NVIDIA Blackwell and Dynamo Reduce Costs and Improve Performance for Scaling Test-Time Compute Increase DeepSeek-R1 throughput 30x

Large language model (LLM) processing and production on GPUs are disaggregated and coordinated to speed up inference communication among hundreds of GPUs

NVIDIA Dynamo's intelligent inference optimisations increase DeepSeek-R1 token production by over 30x per GPU on a huge cluster of GB200 NVL72 racks

NVIDIA Dynamo is open source and supports PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, allowing researchers, startups, and enterprises to optimize AI model

NVIDIA Dynamo spreads the KV cache, which stores inference system knowledge from prior requests, across thousands of GPUs

AI supplier Cohere plans to integrate NVIDIA Dynamo to enable agentic AI in its Command models

Together AI, the AI Acceleration Cloud, wants to integrate its Together Inference Engine with NVIDIA Dynamo to scale inference workloads across GPU nodes

NVIDIA AI Enterprise will provide production-grade security, support, and stability to NVIDIA Dynamo in the future

An open-source modular inference system called NVIDIA Dynamo is used to support generative AI models in distributed settings

NVIDIA Dynamo supports all major frameworks, including TensorRT-LLM, vLLM, SGLang, PyTorch, and others, so you can quickly create new generative AI models

NVIDIA Dynamo optimises this process by reusing context, offloading to cheap memory, minimising re-computation, and cutting inference costs

Open and modular, NVIDIA Dynamo may integrate its components into the engine to handle more requests while optimising computational investment and resource use