Are You Available? Cloud Uptime a Year After the Great Amazon Web Services Outage


“Can we take availability off cloud concerns list?” Krishnan Subramanian, principal analyst at the newly formed Rishidot Research, asked at the beginning of last year. Google had just announced a new SLA that removed exemptions for planned maintenance and was bragging about 99.984% uptime. Then just a few months later came the Great Amazon Web Services Outage. Some AWS customers experienced multiple days of downtime. In response, Subramanian wrote on Twitter that perhaps it was a time for some introspection on the part of cloud computing advocates, though once the dust had settled it didn’t take long for him to write an impassioned defense of the cloud.

It’s been almost a year since that outage, but since then we’ve seen another small AWS outage (along with a longer one in Dublin), ongoing availability problems with Microsoft BPOS (the predecessor to Office 365) and a recent embarrassing Microsoft Azure outage. I decided to catch up with Subramanian to talk about cloud availability and what it means to enterprises thinking about cloud computing.

“I see it from two different angles,” Subramanian says. First he points out that there are no statistics indicating that downtime is worse in the cloud than on-premises. Subramanian says that based on his interactions with IT managers in large enterprises, there’s actually more downtime on-premises than in the cloud. “Since it is internal and they are responsible for their destiny, it doesn’t get amplified in the media,” he says.

At the same time, though, Subramanian says customers need to design their cloud applications for failure. “In order to offer compute resources at scale for an affordable price, service providers are going to build their cloud platforms on commodity hardware,” says Subramanian. “With commodity hardware, outages happen. And at such large-scale, outages will be difficult to manage. So, it is important that apps are developed for cloud architecture keeping this basic nature of cloud in mind. ‘Design for failure’ has to be the mantra for cloud developers.”

I asked Subramanian about Cloud Foundry and the growing trend towards platform-as-a-service systems to run on multiple clouds, such as those based on Cloud Foundry (like Stackato) and others such as CloudBees, Apprenda, Cumulogic and Orangescape. Others such as AppFog, Nodejitsu or Heroku plan to offer multi-cloud deployments in the near future. Subramanian says that he has advocated for the federated model of cloud infrastructure for a long time, but worries that it might not always be cost-effective. “Depending on the importance of uptime, customers should design for failure and use a multi-cloud environments underneath.”

When asked about the role of cloud standards, Subramanian says that standards will be important but it’s too early to standardize on one particular API now. Eventually standard APIs will make it possible for developers to seamlessly design for failure across multiple clouds at the infrastructure-as-a-service level. In the meantime, PaaS providers are making it seamless to design for failure at the platform level today. Later, the emergence of open source cloud stacks like OpenStack, CloudStack, Eucalyptus, et cetera, may foster the creation of more IaaS providers, increasing competition and driving down costs, he says.

ISPs also play a role in cloud availability. If you can’t connect to the Internet, you can’t reach your public cloud applications. I tell Subramanian about an IT manager I talked to at a small business in the suburbs of New York City who said he can’t move to cloud hosted applications for critical business processes because the company ISP was too unreliable. Subramanian points out that soon most organizations will face pressure on the business side to take advantage of mobility or other opportunities. Mobile may be a bigger driver for cloud adoption than cost savings, and the new possibilities afforded by cloud computing are also more important than shaving costs.

“ISP unreliability is a small problem in this post PC era,” Subramanian says. Personally, I’m not so sure. Wireless reception still causes many connectivity problems, and I think the rising cost of mobile bandwidth may constrain the future of the mobile cloud.

As for those addressing availability issues with a hybrid cloud approach, Subramanian says he sees the hybrid cloud as an on-ramp to the future use of public clouds. “There might be some workloads that could stay on premise because of regulatory needs but most of them will go the public cloud path,” he says. “However, it is not happening now for two reasons.” First, he says many applications are not designed to run in the cloud, plus many cloud platforms are still immature. We’ll see more cloud adoption once enterprises begin to migrate away from legacy solutions. “Second, we are still seeing vendor FUD on cloud computing,” he says.

Services Angle

Since many cloud customers are actually buying through the channel via resellers, consultants and ISVs I ask Subramanian what questions enterprise customers should ask before selecting a cloud provider. Subramanian suggests:

1) If anyone wants a specific SLA, they should make sure that the cloud service provider offers the SLAs they need.
2) They need to make sure that the privacy terms offered by the cloud provider matches their needs. They should also make sure that the terms doesn’t take away the ownership of data from end users.
3) If they are in a regulated industry, they should make sure that the cloud provider meets their needs.

Beyond that, Subramanian has written a paper on questions to ask your cloud vendor:

Ten Questions to Ask Your Cloud Vendor

Your Angle

What availability concerns do you still have? If you’ve already moved to the cloud, how has your uptime been?

Cloud photo by Bill Wakefield