

As a new generation of corporations navigate the efficiencies of cloud computing, they are faced with a new challenge: running a business in a brand-new environment without the benefit of tried and true methods.
“The industry has done a really fabulous job of telling people how to get to cloud, but we’re awful about telling them how to live there,” said Dave Rensin (pictured), director of customer reliability engineering and network capacity at Google Cloud.
Rensin spoke with John Furrier (@furrier) and Jeff Frick (@JeffFrick ), co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the recently concluded Google Cloud Next event in San Francisco. They discussed Google site reliability engineering and how the concept is being turned outwards to help businesses operate successfully in the cloud. (* Disclosure below.)
In 2004 Google LLC had just gone public, and internal calculations showed that in 10 years the company would need a million systems operators just for their popular search function. In its unorthodox way, Google reimagined its production systems by applying software engineering skills to operations problems and named the method Site Reliability Engineering, or SRE.
“The basic philosophy is simple, give to the machines all the things machines can do, and keep for the humans all the things that require human judgment. That’s how we get to a place where like, 2,500 SREs run all of Google,” Rensin said.
A primary principle of SRE is to forget about aiming for perfection. “Any system involving people is going to have errors. So any goal you have that assumes perfection, 100 percent uptime, 100 percent customer satisfaction, zero error, that kind of thing, is a lie,” Rensin said, going on to explain that there is a “magic line” — known as the service level objective — marking the boundary between satisfied, and unsatisfied customers. Operate below the SLO line and customers are angry; operate above it and resources are being wasted on incremental improvements that customers don’t notice.
“The difference between perfection, 100 percent, and the line you need [the SLO], which is very business-specific, we say treat as a budget,” Rensin said. This “error budget” represents time and money that can be spent on innovation.
As director of customer reliability engineering, Rensin takes Google’s internal SRE methodology and turns it outwards to work with businesses of all sizes. Google has published a book on SRE, with an accompanying workbook to help guide companies through implementing SRE in their own operations.
“Our goal is that every firm from five to 50,000 can follow these principles. And they can. We know they can do it, and it’s not as hard as they think,” Rensin concluded.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Google Cloud Next event. (* Disclosure: Google Cloud sponsored this segment of theCUBE. Neither Google nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
THANK YOU