UPDATED 18:50 EDT / DECEMBER 23 2020

CLOUD

Google blames last week’s outage on Google User ID Service error

Google LLC said today that a simple “zero” error was responsible for taking its global authentication system offline and preventing users from accessing Gmail, YouTube and its cloud services for more than an hour last week.

The company said one day after the Dec. 14 outage that its preliminary analysis had found that the cause of the incident was an issue with its automated storage quota management system. That, Google said, caused a reduction in the capacity of its central identity management system, thereby blocking people from accessing services that require them to log in.

The outage only lasted for about an hour, but it was noticed by millions of people around the world. It also affected thousands of companies that rely on Google Cloud Platform for computing resources. That’s bad for business, of course, since the reliability and availability of cloud services are among the most important considerations for any enterprise.

Google’s full incident report provided Tuesday shows the problem was caused by what it calls a “zero” error generated by a legacy storage quota system it uses to provision storage automatically for its authentication system.

“As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0,” the report said. “As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups.”

The Google User ID Service has a unique identifier for each Google account. It handles authentication credentials for the OAuth tokens and cookies that are used to log people in to a service without entering their password each time. This data is stored on a distributed cloud database that uses the Paxos protocol to coordinate updates once it decides which data values it needs to process.

“For security reasons, this service will reject requests when it detects outdated data,” Google said. “An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service.”

Google’s report also covered the impact of the outage on its Google Cloud Storage, Google Cloud Network, Google Kubernetes Engine, Google Workspace (formerly G Suite), and Google cloud support services. It said that “all authenticated Google Workspace apps were down for the duration of the incident.” In addition, about 4% of requests to the GKE control plane API failed, and nearly all customer and Google-managed workloads were unable to report metrics to Cloud Monitoring.

Google’s report concluded that the majority of its authenticated services across Google Cloud and Google Workspace saw “elevated error rate,” and that all of its services that require users to log in with a Google Account were “affected with varying impact.”

Image: Google

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.