Google Colossus Storage: Getting SSD Speeds At HDD Costs
Google's all-purpose storage platform, Colossus, has object storage management and scalability, an intuitive programming style used by all Google teams, and throughput that rivals the best parallel file systems
Hyperdisk ML serves 2,500 nodes @ 1.2 TB/s with a Colossus SSD. Colossus, Spanner's tiered storage feature, combines cheap HDD storage with fast SSD storage in the same filesystem
Colossus SSD caching allows Cloud Storage to provide the cheapest storage while supporting AI/ML applications' demanding I/O. Finally, BigQuery's Colossus-based storage speeds up huge searches
Colossus reduced the complexity of the GFS programming model to an append-only storage system that combines the scalability of object storage with the well-known programming interface of file systems
The Colossus metadata service is composed of “custodians,” who maintain disk-space balance and data durability and availability, and “curators,” who handle interactive control tasks like file creation and deletion
Colossus storage clients store data directly on “D servers,” which house its HDDs or SSDs, after interacting with curators for metadata
A Google Cloud zone’s core building block is a single Colossus filesystem, which Google constructs for each cluster
Many of Google's largest filesystems exceed 50 TB/s read and 25 TB/s write throughputs. This bandwidth could send 100 8K videos per second
It's hard to read 50 TB/s on sluggish discs with all your data. Two major new features in Colossus are SSD data placement and caching, driven by “L4”
Advanced users can utilise “hybrid placement” to tell Colossus to keep only one replica on the SSD: ssd.1/myfile /cns/ex/home/leg/partition. This solution is cheaper, but HDD lag occurs if the D server hosting the SSD copy is down
In reaction to cache misses, L4 may add accessed data to SSD cache. It does this by ordering an SSD storage server to shift HDD data. L4 eliminates items when the cache fills to make place for new entries
Google Cloud use these I/O patterns to simulate placement requirements like “place on SSD for one hour,” “place on SSD for two hours,” and “don’t place on SSD” online. This scenario helps L4 choose the best policy for each group
Google Cloud relies on Colossus storage to serve billions of users. Advanced SSD positioning features dynamically respond to workload variations to lower costs and boost performance