UPDATED 09:19 EDT / SEPTEMBER 24 2010

Facebook’s Worst Outage Makes Users Yearn for the Fail Whale

Facebook was perhaps the biggest topic  in the cloud yesterday, as it shuts operation for 2.5 hours, experiencing its worst outage in 4 years. The problem is not just the website; its widgets, applications and the Like button, which seems to be everywhere on the internet, was down as well.  This caused enumerable complaints from users.

The problem was caused by Facebook’s automated system that check invalid configuration values in its cache. Instead of helping, the system did otherwise, causing Facebook to make the tough decision of shutting the entire site down in order to prevent data loss.

“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second,” says Robert Johnson, Director of Software Engineering of Facebook.

“To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover,” he added.

So did Facebook handle its outage well?  The company blog was updated, indicating Facebook’s dedication to conveying ongoing issues to its massive user base.  But some users complain that human testing should’ve prevented such debilitating errors in the first place, and writing a blog post about the issues after the fact does very little to address user concerns.

Several users also went on to suggest that Facebook create an error message page, similar to the Twitter fail whale, to keep them abreast as the problem occurs.  Perhaps that’s why Twitter users really don’t seem too put off by site inaccessibility–as long as that fail whale is there to keep users in the know.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU