SSD and Network Limits for GPU Cluster Storage

The Future of Memory and Storage 2024 (FMS 2024) conference in Santa Clara featured many large-capacity SSD presentations

Large language models (LLM) with exponential growth require an increasing amount of data for training

HDDs are finding it impossible to keep up with this enormous increase, even if users attempt to stripe data across thousands of HDDs

Public LLMs need user data for optimization and fine-tuning and application-specific data for expedited RAG during inference

Transferring data across various storage systems is a complicated, costly, and power inefficient process

IOPS, or input/output operations per second, is a common metric used to assess SSD performance in compute systems

With bandwidths up to 50 times higher than HDDs, SSDs allow for the same system throughput to be achieved with fewer SDDs than with many HDDs