Facebook outage was caused by a routine maintenance error
Facebook Inc. today provided details of its global outage that saw its services offline on Monday, and the explanation is bizarrely simplistic: It was a routine maintenance error.
In a blog post, Santosh Janardhan, vice president of engineering and infrastructure at Facebook, said the outage was triggered by a system that manages the social networking giant’s global backbone capacity.
“The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers,” Janardhan explained. “In the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.”
That, he added, was what caused yesterday’s outage. “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” Santosh said.
The outage was specifically related to Facebook’s domain name system entries. Some reports around the time of the outage suggested that Facebook had been attacked and its DNS hijacked, but it turned out to be simply human error.
That the issue took so long to fix is another consideration but should be taken in the context of propagation of DNS settings. Anyone who has ever owned or run a website knows that DNS settings take considerable time to propagate. NS1 noted that a change of DNS record can take up to 72 hours to propagate worldwide. That Facebook and its related services such as Instagram and WhatsApp were offline for about six hours is notable, since despite the unusually long duration for such a prominent website, it could have been far longer.
“Incidents like what we see happening with the outages for Facebook, WhatsApp and Instagram demonstrate that operating networks at scale, globally, can be daunting to even to the cloud-scale companies,” Shashi Kiran, chief marketing officer at networking Aryaka Networks Inc., told SiliconANGLE. “We see this as yet another reminder of the strategic role that wide-area networks play in delivering business continuity and application performance.”
The outage not only affected Facebook, Instagram and WhatsApp users but also founder and Chief Executive Officer Mark Zuckerberg. Facebook’s share price dropped significantly due to the outage, causing Zuckerberg’s net worth to drop by about $6 billion. Facebook’s share price has recovered somewhat today, up 2%, to $332.96, but it’s still down from where it was trading before the outage.
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.