

Facebook was perhaps the biggest topic in the cloud yesterday, as it shuts operation for 2.5 hours, experiencing its worst outage in 4 years. The problem is not just the website; its widgets, applications and the Like button, which seems to be everywhere on the internet, was down as well. This caused enumerable complaints from users.
The problem was caused by Facebook’s automated system that check invalid configuration values in its cache. Instead of helping, the system did otherwise, causing Facebook to make the tough decision of shutting the entire site down in order to prevent data loss.
“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second,” says Robert Johnson, Director of Software Engineering of Facebook.
“To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover,” he added.
So did Facebook handle its outage well? The company blog was updated, indicating Facebook’s dedication to conveying ongoing issues to its massive user base. But some users complain that human testing should’ve prevented such debilitating errors in the first place, and writing a blog post about the issues after the fact does very little to address user concerns.
Several users also went on to suggest that Facebook create an error message page, similar to the Twitter fail whale, to keep them abreast as the problem occurs. Perhaps that’s why Twitter users really don’t seem too put off by site inaccessibility–as long as that fail whale is there to keep users in the know.
THANK YOU