UPDATED 10:10 EDT / MAY 28 2014

Joyent: Don’t fire the admin who took down your cloud

time for business businessman clockI’ve made no secret of my concern that the cloud is a basket not nearly ready to hold all the eggs. The latest example of why this is true occurred Tuesday at cloud operator Joyent when, as The Register tells it, “a fat-fingered admin brought down an entire data center’s compute assets.”

That’s probably the worst thing that can happen to a cloud data center that doesn’t involve the loss of customer data. El Reg, as it seems to prefer being called, describes Joyent — the operator in question — as “now home to most mortified sysadmin in the USA.”

“Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted.  Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time,” Joyent wrote.

“We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational.  We will be providing frequent updates until the issue is resolved.”

The outage began at about 3:30pm PT and continued for about an hour.

“While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter,” explained Joyent’s chief technology officer Brian Cantrill in a post to Hacker News. “As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn’t happen in the future”.

It’s easy to imagine some companies quickly escorting the “mortified sysadmin” out to the parking lot, never to be seen at Joyent again. I hope that isn’t what happened in this case.

Joining the big leagues…of failures

 .

To me, “fat fingers” is a bit of a hero. He helped who-ever-heard-of-Joyent join the big leagues of cloud vendors — Amazon, Rackspace, Microsoft, Google — who have suffered (usually worse) failures.

Because of this mistake, Joyent says it will be looking at its systems from top-to-bottom to see how a simple operator error could bring down a data center. That clearly is not the sysadmin’s fault if he didn’t go well out of this way to screw-up.

And even if he did, such a shutdown just shouldn’t be possible without a lot more authority that a single sysadmin should ever be allowed. If anyone is personally responsible for the outage it is CTO Cantrill, but I’ll give him a bit of break on this breakdown. Let’s see how he responds and what he tells customers going forward.

On its website, Joyent says its hiring. If they are looking for talent to make its cloud more idiot-proof and resilient, go for it. But if there is a sudden unexpected vacancy due to the departure of one particular sysadmin, I’d be very disappointed.

Lesson learned

 .

If Joyent learns its lesson(s) from this episode and worse problems are prevented in the future, this may be the luckiest mistake the company ever suffered.

The moral of this story — for the company — ought to be “forgive and learn.”

For all cloud customers — not just Joyent’s — the moral is “demand third-party certification” to support your trust in the cloud vendors your company uses. Stuff like this just shouldn’t happen.

photo credit: Adam Foster | Codefor via photopin cc

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU