As Forbes reports “Amazon’s Cloud service is having a bad a couple weeks. For the second time in as many weeks Amazon’s East Coast cloud crashed during a severe storm that left 1.3 million in the Washington D.C. area without power. The outage brought down numerous high profile web sites hosted on Amazon including Netflix, Instagram, Pinterest, and Heroku. Making things worse was the fact that other cloud services hosted in the area experienced no downtime.”
Amazon has been the most successful cloud computing platform in getting startups up and running from prototype to full scale. Many examples are out there including Neflix, Pinterest Zynga, Instagram just to name a few of the top names.
This is a case study in disaster recovery or disaster avoidance. So the banter on Twitter over the weekend caught my attention.
So with some digging last night I cobbled together what happened, why it happened, and how to avoid it.
The “Play by Play” of the Amazon outage
“Here’s the play-by-play:
At 11:21 PM EST, Amazon Web Services reported, ”We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.” And at 11:31 EST, it added, ”We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.” By 11:49 EST, it reported that, “Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.” But by 12:20 EST the outage continued, “We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.” At 12:54 AM EST, AWS reported that “EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.”
WHO WAS IMPACTED:
Netflix – Streaming service down for 3 hours.
Pinterest – Down for 4 hours
Instagram – Down for 10 hours
Heroku – Running in degraded mode for nearly 24 hours (https://status.heroku.com/incidents/387#update-1172)
WHY THIS MATTERED:
People’s Friday night entertainment selections obviously had an unexpected shut down. The severity of the reactions on Twitter and among the mainstream and tech media demonstrate that people’s entertainment preferences and habits have shifted from TV to content streaming and constant social interaction. That’s why it’s all the more critical for these sights to maintain consistent uptime.
The commonality here is that all of these companies use AWS for their cloud compute resources, known as Elastic Compute 2 (EC2) and Elastic Block Store (EBS). It’s not about the ability to recover—which could take anywhere from hours to a full day in this case—but time to recovery. With all of these vendors having a sole (if not heavy) reliance on AWS, their time to recover was at the mercy of Amazon’s ability to get its infrastructure back under control.
As cloud insider @Ruv pointed out on Forbes story “Cloud Computing Forecast: Cloudy with A Chance of Fail” that this is also Amazon’s second major outage in less than a month—and ironically it took place one day after Google’s Compute Engine announcement.
HOW THIS COULD HAVE BEEN AVOIDED:
Pinterest, Heroku, Netflix and Instagram, and the hundreds of other companies that have all-AWS infrastructure could have performed an automatic failover to another cloud provider and provided continuous data accessibility to their users. I’m sure that the big boys like IBM, HP, EMC will be all over this as a case study in Data Protection and Disaster Recovery as will all the new vendors like VMware and Nirvanix who sell solutions to avoid this.
These back-to-back AWS outages are a clear example of why best practices would be to maintain your compute data and storage volumes on more than one cloud services provider. This will become more of the trend going forward was public cloud scales to fill the web scale market. That is having more than one cloud provider with multiple copies of your data in several locations that are geographically dispersed.
Just as companies maintain a multi-vendor policy for their physical IT infrastructure, they should follow a similar path for their outsourced, virtual cloud infrastructure as well. This no doubt will become part of the emerging consulting ServicesANGLE landscape as firms like Cloudscaling and Enstratus continue to grow their software and services offerings. I’ve said many times on SiliconANGLE.tv #theCUBE that a new service consultancy will fill the space to meet the new SLAs of public (web scale) cloud.
WHAT ARE SOME OTHER OPTIONS TO CONSIDER:
There are many other reputable cloud providers that can be activated during a crisis such as this to provide automatic failover and enable a company to resume business operations. According to GigaOM, Joyent, which was in the same physical location as Amazon’s affected data center (Ashburn, D.C.), didn’t experience any of the problems that Amazon did. Joyent, which recently raised $85M, offers cloud compute services and should definitely be considered as a viable secondary cloud provider for those relying on AWS as a primary.
Also worth considering is IBM, with its Smartcloud Enterprise services for cloud computing and cloud storage. Obviously if IBM’s cloud is architected to handle the demands of the biggest Fortune 100 shops out there, they would make for an ideal secondary cloud provider.
One of the interesting capabilities of IBM’s cloud is its capability to provide live, active/active replicas. Through its OEM deal with Nirvanix, IBM is not providing passive copies of your data—they are literally active/active copies. What this means is if one location in IBM’s cloud is down, a live replica will automatically be spun up and served to the customer. It doesn’t have to be recovered or recompiled—it’s already queued up and ready to go. Nobody else is offering this capability today and in an emergency scenario like we saw Friday night it could be a critical solution that could mean the difference between restoring service in minutes versus several hours.
Other companies to consider with the big boys include VMware and Nirvanix. Nirvanix is a fast growing cloud startup backed by Khosla Ventures and Valhalla Partners whose progress we have been tracking closely over the past few years and they have been mounting some serious multi-petabyte wins at places like Cerner Healthcare, USC and National Geographic.
Nirvanix was online this weekend highlighting their solution as what could have been in place to help avoid the downtime for folks like Netflix. What Nirvanix has been offering many companies is the ability to have a “shared nothing” private cloud, which means you have your own storage pools, your own dedicated network, your own namespace, your own multi-tenant file directory and your own management console.
The shift to private clouds is another best practice that companies can consider as an alternative to AWS, just like Zynga did. The unique part about Zynga is that their private cloud can federate to Amazon when needed, so they are not entirely reliant upon Amazon—but can access Amazon’s resources whenever required. DRFortress is a Hawaii-based MSP that is leveraging a similar deployment for cloud storage, by keeping a private cloud in Honolulu with resources on demand from the global Nirvanix public cloud.
–Rely on more than one cloud compute and cloud storage vendor for your back-end cloud infrastructure. Joyent, IBM, and Nirvanix are strong candidates that should be evaluated for your business.
–Look to deploying your own private cloud like Cerner or implement a hybrid strategy that enables you to scale external resources when required, like Zynga.
This recent outage still makes me wonder if two things: 1) is AWS viable full scale and 2) Netflix’s future platform might want to be mixed (like Zynga) to hedge against Amazon being a future competitor to Netflix. I wrote about this dynamic earlier in the year – Netflix is funding the enemy.
On point #2 above ironically, Amazon Instant Video on Demand was not affected by the recent outage…