LLM D: The Next Generation Of AI Inference

llm-d is a new open-source, Kubernetes-native distributed inference serving stack for large language models, founded by Google Cloud, IBM Research, NVIDIA, Red Hat, and CoreWeave

The project aims to make scalable, affordable, and efficient LLM inference accessible to everyone, with a strong community-led development model under the Apache-2 license

llm-d builds on vLLM’s high-performance inference engine, optimizing it for distributed, disaggregated serving in Kubernetes environments

A vLLM-optimized inference scheduler replaces traditional round-robin load balancing, using telemetry and scoring algorithms to route requests for lower latency and better hardware utilization

Disaggregated serving separates the prefill and decode phases of LLM inference across different instances, reducing latency and increasing throughput

Two caching schemes are planned: independent (N/S) caching for low operational cost, and shared (E/W) caching for higher performance with global indexing

Planned variant autoscaling will dynamically adjust resources based on hardware, workload, and traffic, using Horizontal Pod Autoscalers (HPA) for SLO efficiency