This weekend some Google e-mail users woke up to discover most of their messages, contacts, and recorded chat sessions had gone up in smoke. An event that Google’s engineers have determined was caused by the introduction of a bug into their most recent backup software. While the bug only affected a total of .02% of their userbase, that’s still almost 40,000 users.
A great deal of these users had their information restored very quickly when Google changed the data centers that they directed the users to and reconstructed their data from the multiple sources they have for first stage backup. However, as those were also affected, Google has revealed that they also back up information offline on tape.
“To protect your information from these unusual bugs, we also back it up to tape,” wrote Ben Treynor, Google’s VP Engineering and Site Reliability Czar. “Since the tapes are offline, they’re protected from such software bugs. But restoring data from them also takes longer than transferring your requests to another data center, which is why it’s taken us hours to get the email back instead of milliseconds. ”
For those who don’t know about tape backups, imagine older movies depicting computers are reel-to-reel refrigerator size boxes spooling magnetic tape back and forth. They don’t look like that any more, but when tape backup comes to mind this is exactly the image that springs to mind. Since the 20th century, we’ve moved a bit beyond that. Google probably doesn’t have a giant, cavern-spanning datacenter hidden beneath some volcano in the Pacific that contains acres and acres of reel-to-reel tape machines spinning away day and night backing up your data. No matter how much we’d love to see an installation like that just to put Google up there with other technology crazed supervillains of James Bond movies.
In fact, EMC recently launched a technology that obsoletes tape backups entirely with Data Domain De-duplication for everything except a total meltdown.
Actual tape backups of Google’s cloud-based storage network would take forever to restore from and possibly even longer to actually record. Especially noting how often and much information they’re storing and changing on a regular basis.
Instead, what they probably do is a type of virtualized tape storage that involves multiple data centers and a lot of hard drives. Most of them probably spend a lot of their day idle, chilling until they’re either needed to awaken to spill their contents for a back up or to take on the next round of backups. The result is that the large array of disks remains largely offline except to record the new backups or to provide restoration in the event that the primary backups fail. The virtualized tape backups are “offline” in the sense that they don’t interact with the primary backups and serve their role to decentralize and blunt the impact of massive crashes at Google’s bigger data centers.
Another reason why Google is probably using virtualized tape (and not actual tape) is how much of the data they’re able to restore to the users who lost it. Real tape is notoriously slow and thus without a behemoth setup, Google wouldn’t be able to pour that much data out of their network very quickly so people who lost all their data wouldn’t get most of it restored. They’d end up missing some span of time behind the crash since the last backup. Few details are available on how much data is missing from the restored users; it’s likely that they’re almost entirely restored and not missing weeks or days of data.
Some media outlets have been using this as an example of the failure of the cloud; but I think it’s a brilliant example of how well the cloud can be used for crash recovery. Especially because when things go wrong in the cloud, due to its decentralized nature, it tends to affect only a few users and those users get back up on their feet more quickly. If this happened to a monolithic data center, the damage could certainly be more widespread and the recovery would involve a lot more direct activity to restore everyone.