Google JetStream

JetStream is an open-source inference engine optimized for throughput and memory, designed specifically for TPUs, with future GPU support planned

JetStream improves performance and memory efficiency for Large Language Model (LLM) inference, supporting models like Llama, GPT, and Gemma

Initially designed for XLA devices, JetStream supports TPU v5e and sixth-generation Trillium TPUs, offering significant performance improvements

JetStream integrates Google’s Pathways runtime, enabling multi-host inference and disaggregated serving for large models, improving scalability and efficiency

Pathways allows models to be split across multiple accelerator hosts, enabling inference for large models like Llama 3.1 405B, achieving 1703 tokens/s on Trillium TPUs

Pathways dynamically scales the prefill and decode phases of LLM inference, improving efficiency and performance for large models like Llama2-70B

Trillium TPUs deliver 2.9x throughput for Llama 2 70B and 2.8x for Mixtral 8x7B compared to TPU v5e, with three times the inference per dollar

Trillium TPUs reduce costs for image generation, producing 1000 images for as little as 22 cents, 35% cheaper than TPU v5e

JetStream is available on GitHub under the Apache-2.0 license, with tools for benchmarks, local setups, and online inference using TPUs

JetStream is part of Google’s AI Hypercomputer ecosystem, which includes hardware like TPUs and NVIDIA GPUs