UPDATED 14:38 EDT / APRIL 19 2021


Google shares technical overview of its exabyte-scale Colossus file system

Google LLC today published a technical blog post detailing Colossus, the internal file system that powers Google Cloud and many of the company’s consumer services, including its namesake search engine.

Colossus is a software platform responsible for managing the storage hardware running in the Alphabet Inc. subsidiary’s data centers. It also helps move information in and out of the storage hardware for the applications that depend on it.

At the scale at which Google operates, making sure that users can access their data reliably involves a lot of computational heavy lifting. One of Colossus’ main functions is reducing and, when needed, fixing technical issues that might prevent customers from retrieving their information when they need it. 

At Google’s data centers, “hardware is failing virtually all the time — not because it’s unreliable, but because there’s a lot of it,” explained Dean Hildebrand, a technical director at Google Cloud’s Office of the Chief Technology Officer, and Denis Serenyi, the tech lead for Google Cloud Storage. “Failures are a natural part of operating at such an enormous scale, and it’s imperative that its file system provide fault tolerance and transparent recovery.”

Colossus uses a set of programs Google’s engineers refer to as Custodians to ensure the reliability of storage hardware. When one of the disk drives inside a storage system fails, the Custodians can reassemble the lost information from the data remaining in the drives that are still operational. The programs also perform a variety of other tasks to increase the durability of Google’s storage environments, which can scale to multiple exabytes and tens of thousands of machines, according to Hildebrand and Serenyi.

The Custodians are one part of Colossus’ so-called control plane, a complex mesh of software components that functions as the system’s control center. It also comprises so-called Curators. These are another set of specialized programs that in turn focus on managing the metadata associated with the files customers and Google’s own services store in Colossus. Metadata contains descriptive details about files, such as when they were created, that are used in many operational processes.  

Yet another task the control plane manages is optimizing performance and cost. Colossus stores the records that Google services and customers use the most often on high-speed flash storage so they can be accessed quickly. Data used less often is kept on less speedy but more cost-efficient disk storage.

Colossus separately optimizes the data that is sent to disk. The most frequently accessed information is often also the newest, Hildebrand and Serenyi explained, a correlation that allows Colossus to improve performance for customers. The system speeds up data operations carried out on the newest and most frequently accessed records through a method that involves distributing them among many disk drives. Over time, as newer records take their place, the earlier information is moved to larger capacity drives to optimize operational efficiency.

The different hardware resources, Custodians, Curators and other complex details of Colossus are hidden behind an abstraction layer to make working with the system easier. Developers don’t have to specify the exact type of storage hardware their applications require. Instead, they only need to provide high-level details such as capacity and speed requirements. Colossus then automatically assigns their data to the most suitable hardware in Google’s data centers.

Image: Google

A message from John Furrier, co-founder of SiliconANGLE:

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

Join Our Community 

We are holding our third cloud startup showcase on Sept. 22. Click here to join the free and open Startup Showcase event.

“TheCUBE is part of re:Invent, you know, you guys really are a part of the event and we really appreciate your coming here and I know people appreciate the content you create as well” – Andy Jassy

We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.