JetStream is an open-source inference engine optimized for throughput and memory, designed specifically for TPUs, with future GPU support planned
JetStream improves performance and memory efficiency for Large Language Model (LLM) inference, supporting models like Llama, GPT, and Gemma
Initially designed for XLA devices, JetStream supports TPU v5e and sixth-generation Trillium TPUs, offering significant performance improvements
JetStream integrates Google’s Pathways runtime, enabling multi-host inference and disaggregated serving for large models, improving scalability and efficiency
Pathways allows models to be split across multiple accelerator hosts, enabling inference for large models like Llama 3.1 405B, achieving 1703 tokens/s on Trillium TPUs
Pathways dynamically scales the prefill and decode phases of LLM inference, improving efficiency and performance for large models like Llama2-70B
Trillium TPUs deliver 2.9x throughput for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e, with three times the inference per dollar
Trillium TPUs reduce costs for image generation, producing 1000 images for as little as 22 cents, 35% cheaper than TPU v5e
JetStream is available on GitHub under the Apache-2.0 license, with tools for benchmarks, local setups, and online inference using TPUs
JetStream is part of Google’s AI Hypercomputer ecosystem, which includes hardware like TPUs and NVIDIA GPUs