

Google’s cloud platform suffered another outage over the weekend in an incident that seems similar to a more serious brownout that took place last month.
Google said the latest outage affecting its Compute Engine, which was first reported by The Register, came about due to a “packet loss on agress network traffic”, which caused an array of symptoms ranging from “unusually slow responses, to timeouts attempting to contact the VM.” The incident apparently lasted for just 43 minutes, during which time VMs stayed online, and after which things went back to normal.
The outage was somewhat similar to a much more widely reported incident in February, when “The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information,” leading to downtime across multiple zones for about one hour.
Explaining the latest incident, Google put out the following statement: “The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.”
Google pointed out that this weekend’s outage was not nearly as severe as the one seen in February, and said it was now busy investigating why prior testing of the configuration change failed to pick up any potential problems with the service.
The good news is that Google is planning to tighten things up to prevent any interruptions like this from happening again, saying that future changes won’t be introduced until its test suite has been improved to the point where it demonstrates parity with behavior seen in actual production.
“Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior,” the company said.
Google’s embarassment follows a similar serious outage over at Microsoft’s Azure cloud in November 2014, which affected the availability of its Office 365 and OneDrive cloud storage services.
Both Google and Microsoft are generally viewed as leading players in the world of hyperscale cloud computing, but as these incidents continue to demonstrate, the cloud is still an evolving technology and not one major provider can yet claim 100 percent uptime.
Photo Credit: ThoseBlastedApes via Compfight cc
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.