Google Colossus Storage: Getting SSD Speeds At HDD Costs

Google's all-purpose storage platform, Colossus, has object storage management and scalability, an intuitive programming style used by all Google teams, and throughput that rivals the best parallel file systems

Hyperdisk ML serves 2,500 nodes @ 1.2 TB/s with a Colossus SSD. Colossus, Spanner's tiered storage feature, combines cheap HDD storage with fast SSD storage in the same filesystem

Colossus SSD caching allows Cloud Storage to provide the cheapest storage while supporting AI/ML applications' demanding I/O. Finally, BigQuery's Colossus-based storage speeds up huge searches

Colossus reduced the complexity of the GFS programming model to an append-only storage system that combines the scalability of object storage with the well-known programming interface of file systems

The Colossus metadata service is composed of “custodians,” who maintain disk-space balance and data durability and availability, and “curators,” who handle interactive control tasks like file creation and deletion

Colossus storage clients store data directly on “D servers,” which house its HDDs or SSDs, after interacting with curators for metadata

A Google Cloud zone’s core building block is a single Colossus filesystem, which Google constructs for each cluster

Many of Google's largest filesystems exceed 50 TB/s read and 25 TB/s write throughputs. This bandwidth could send 100 8K videos per second

It's hard to read 50 TB/s on sluggish discs with all your data. Two major new features in Colossus are SSD data placement and caching, driven by “L4”

Advanced users can utilise “hybrid placement” to tell Colossus to keep only one replica on the SSD: ssd.1/myfile /cns/ex/home/leg/partition. This solution is cheaper, but HDD lag occurs if the D server hosting the SSD copy is down

In reaction to cache misses, L4 may add accessed data to SSD cache. It does this by ordering an SSD storage server to shift HDD data. L4 eliminates items when the cache fills to make place for new entries

Google Cloud use these I/O patterns to simulate placement requirements like “place on SSD for one hour,” “place on SSD for two hours,” and “don’t place on SSD” online. This scenario helps L4 choose the best policy for each group

Google Cloud relies on Colossus storage to serve billions of users. Advanced SSD positioning features dynamically respond to workload variations to lower costs and boost performance