UPDATED 12:22 EDT / APRIL 16 2020

CLOUD

Building for resiliency: Gremlin’s chaos engineering exposes vulnerabilities early

The job of a site reliability engineer, also known as an SRE, is typically characterized as bringing engineering principles to infrastructure and operations problems, with a focus on creating reliable and scalable systems.

What most SRE job descriptions fail to add is that the job also requires an ability similar to that of a firefighter who must charge into a burning building and rapidly find the root cause of blaze.

“With large-scale incidents, you really need to be able to act fast,” said Tammy Butow (pictured, left), principal site reliability engineer at Gremlin Inc. “If you detect an incident faster, then you’ve got a better chance of making the impact lower so you can contain the blast radius. If you have a fire in the saucepan in your kitchen and you put it out, that’s way better than waiting until your entire house is on fire.”

Butow spoke with John Furrier, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio. She was joined by Alberto Farronato (right), senior vice president of marketing at Gremlin, and they discussed the company’s approach toward protecting complex systems through testing and an upcoming free event.

Lessons learned in failover

Gremlin characterizes itself as the world’s first hosted chaos engineering service to build a more reliable internet by turning failure into resilience. The goal is to safely experiment on complex systems before they fail and cost enterprises millions in lost business.

“It’s not so much about creating chaos, but managing chaos that is built into our current system and exposing vulnerabilities before they create problems,” Farronato explained. “You need a fundamentally different approach to go and find where your weaknesses are before they happen.”

This practice can be especially helpful within large financial institutions, such as the National Australia Bank, where Butow worked for six years prior to joining Gremlin.

“We would do large-scale disaster recovery,” Butow said. “That’s where you would failover an entire data center to a secret data center in an unknown location. The reason is because you’re checking to make sure that everything operates OK in a nuclear blast, and you had to do that practice every quarter.”

The coronavirus pandemic was an opportunity for Gremlin to create a new conference – a free virtual event called Failover Conf, scheduled now for April 21.  This is in addition to the annual Chaos conference.

“We quickly pivoted as a company and created a new online event to give everyone in the community the opportunity,” Farronato said. “This is a conference for anybody who is interested in resiliency if you want to know from the best on how to build business continuity across systems, people and processes.”

Here’s the complete video interview, one of many CUBE Conversations from SiliconANGLE and theCUBE:

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU