GCP – Colossus under the hood: How we deliver SSD performance at HDD prices
From YouTube and Gmail to BigQuery and Cloud Storage, almost all of Google’s products depend on Colossus, our foundational distributed storage system. As Google’s universal storage platform, Colossus achieves throughput that rivals or exceeds the best parallel file systems, has the management and scale of an object storage system, and an easy-to-use programming model that’s used by all Google teams. Moreover, it does all this while serving the needs of products with incredibly diverse requirements, be it scale, affordability, throughput, or latency.
Example application |
I/O sizes |
Expected performance |
BigQuery scans |
hundreds of KBs to tens of MBs |
TB/s |
Cloud Storage – standard |
KBs to tens of MBs |
100s of milliseconds |
Gmail messages |
less than hundreds of KBs |
10s of milliseconds |
Gmail attachments |
KBs to MBs |
seconds |
Hyperdisk reads |
KBs to hundreds of KBs |
<1 ms |
YouTube video storage |
MBs |
seconds |
Colossus’ flexibility shows up in a number of publicly available Google Cloud products. Hyperdisk ML utilizes Colossus solid state disk (SSD) to support 2,500 nodes reading at 1.2 TB/s — impressive scalability. Spanner uses Colossus to address cheap HDD storage with super-fast SSD storage in the same filesystem, the foundation of its tiered storage feature. Cloud Storage uses Colossus SSD caching to deliver the cheapest storage while still supporting the intensive I/O of demanding AI/ML applications. Finally, BigQuery’s Colossus-based storage provides super-fast I/O to extra-large queries.
We last wrote about Colossus some time ago and wanted to give you some insights on how its capabilities support Google Cloud’s changing business and what new capabilities we’ve added, specifically around support for SSD.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1d6afa73a0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Colossus background
But first, here’s a little background on Colossus:
- Colossus is an evolution of the Google File System (GFS).
- The traditional Colossus filesystem is contained in a single datacenter.
- Colossus simplified the GFS programming model to an append-only storage system that combines file systems’ familiar programming interface with the scalability of object storage.
- The Colossus metadata service is made up of “curators” that deal with interactive control operations like file creation and deletion, and “custodians,” which maintain the durability and availability of data as well as disk-space balancing.
- Colossus clients interact with curators for metadata and then directly store data on “D servers,” which host its HDDs or SSDs.
It’s also important to understand that Colossus is a zonal product. We build a single Colossus filesystem per cluster, an internal building block of a Google Cloud zone. Most data centers have one cluster and thus one Colossus filesystem, regardless of how many workloads run inside the cluster. Many Colossus filesystems have multiple exabytes of storage, including two different filesystems that have in excess of 10 exabytes of storage each. Being able to scale this high helps ensure that even the most demanding applications don’t run out of disk space close to their cluster compute resources within a zone.
These same demanding applications also need a large amount of IOPS and throughput. In fact, some of Google’s largest filesystems regularly exceed read throughputs of 50 TB/s and write throughputs of 25 TB/s. This is enough throughput to send more than 100 full-length 8K movies every second!
Nor do we rely on Colossus only for large streaming I/Os. Many applications do small log appends or small random reads. Our busiest single cluster delivers over 600M IOPS, combined between reads and writes.
Of course, to achieve such great performance, you need to get the right data to the right place. It’s hard to read at 50 TB/s if all of your data is located on slow disk drives. This leads us to two important new innovations in Colossus: SSD caching and SSD data placement, both powered by a system we call “L4”.
What’s new in Colossus SSD placement?
In our previous Colossus blog post, we talked about how we place the hottest data on SSDs and balance the remaining data across all of the devices in the cluster. This is all the more pertinent today, as over the years, SSDs have gotten more affordable, increasing their prominence in our data centers. No storage designer would ever just spec a system just built out of HDDs anymore.
However, SSD-only storage still poses a substantial cost premium over a blended storage fleet of SSD and HDD. The challenge is putting the right data — the data that gets the most I/Os or needs the lowest latency — on SSD while keeping the bulk of the data on HDD.
With that, let’s look at how Colossus identifies the hottest data.
Colossus has several ways to select data for placement on SSDs:
-
Force the system to place data on SSD: Here, an internal Colossus user forces it to place data on SSD. Users can do that through the path:
/cns/ex/home/leg/partition=ssd/myfile
. This is the easiest approach, and ensures that the file is completely stored on SSD. At the same time, it is the most expensive option. -
Utilize hybrid placement: More sophisticated users can take advantage of “hybrid placement” and tell the Colossus system to place only one replica on SSD:
/cns/ex/home/leg/partition=ssd.1/myfile
. This is a more affordable approach, but if the D server with the SSD copy is unavailable, accesses suffer from the latency of HDDs. -
Use L4: For the bulk of the data at Google, most developers use the newer L4 distributed SSD caching technology, which dynamically picks the data that is most suitable for SSD.
L4 read caching
The L4 distributed SSD cache analyzes the access patterns for an application and automatically places the data that is most suited for SSD.
When functioning as a read cache, L4 index servers maintain a distributed read cache:
That means that when an application wishes to read some data, it first consults an L4 index server. That index informs the client whether the data is in cache, in which case the client reads the data from one or more SSDs, or tells the cache it is a cache miss, and the client fetches the data from the disk Colossus has placed it on.
On cache misses, L4 can decide to insert the accessed data into the SSD cache. It does so by informing an SSD storage server to transfer the data from the HDD server. Eventually, as the cache fills up, L4 deletes some items from the cache, freeing space for new insertions.
L4 can be more or less aggressive about how much data to place on SSD. We use a machine learning (ML) powered algorithm to decide between different policies for each workload: insert into the L4 cache when the data is written, after the first time it is read, or only after the second time it is read within a short time period. To learn more about how we do this, please read the CacheSack paper.
This approach works well for applications that read the same data often and has dramatically improved our IOPS and throughput. At the same time, it does have a major weakness: we still write the new data to an HDD. And it turns out that there are other important classes of data where L4 read caching isn’t as effective at saving resources as we’d like, namely data that is written, read, and deleted quickly (such as intermediate results for a large batch-processing job), and database transaction logs and other files that see many tiny appends. Because both of these workloads are poorly suited to HDD, it’s preferable to write them directly to SSD, and skip HDD entirely.
L4 writeback for Colossus
Now, imagine that an internal Colossus user wants to place a portion of their data on SSD — they need to carefully think about which files should go on SSD and how much SSD quota they should purchase for their workload. And if they have older files that aren’t being accessed, they might want to migrate that data from SSD to HDD. But from watching our users’ experience, we know that deciding on these parameters is quite hard. To help, we enhanced the L4 service to automate this work.
When being used as a writeback cache, the L4 service advises Colossus curators on whether or not to put a new file on SSD, and for how long. This is tricky! At file creation time, Colossus can only see the application that’s creating the file and its name — it doesn’t know for sure how it will be used.
To solve this problem, we use the same approach as the L4 read cache described in the CacheSack paper mentioned above. The application passes features to L4 such as the file type, or metadata about the database column whose data is stored there. L4 uses these features to segregate the files into “categories” and observes the I/O patterns of each category over time. These I/O patterns drive an online simulation of different placement policies, such as “place on SSD for one hour,” “place on SSD for two hours,” or “don’t place on SSD.” Based on this simulation, L4 chooses the best policy for each category.
These online simulations also serve another important purpose: They predict what placement L4 would choose if more or less SSD capacity were available. Thus, we can predict how much I/O can be offloaded from HDD with different amounts of SSD. These signals drive purchases of new SSD hardware and inform planners of ways to shift SSD capacity between applications to maximize efficiency.
When instructed, the curator can then direct new files to SSD rather than to the default HDD. After a set amount of time, if the file still exists, the curator migrates the data from SSD to HDD:
When the L4 systems’ simulation correctly predicts the file access patterns, we place a small portion of our data on SSD, which absorb most of the reads (which tend to happen to newly created files), before migrating the data to cheaper storage, minimizing the overall cost. In the best case scenario, the file is deleted before we migrate it to HDD, avoiding all HDD I/O.
Colossus SSD and Google Cloud
As the basis for all of Google and Google Cloud, Colossus is instrumental in delivering reliable services for billions of users, and its sophisticated SSD placement capabilities help keep costs down and performance up while automatically adapting to changes in workload.
Ultimately, our goal is to maximize storage efficiency and performance without end-users having to be an expert on every detail of HDDs, SSDs, and Colossus. We’re proud of the system we’ve built so far and look forward to continuing to improve the scale, sophistication, and performance. Visit us at Next ‘25 and attend breakout sessions “What’s new with Google Cloud’s Storage” (BRK2-025) and “AI Hypercomputer: Mastering your Storage Infrastructure” (BRK2-020) to learn more.
Read More for the details.