What is Site Reliability Engineering (SRE) at Google?

Kalyani Kathar

Kalyani Kathar

09 April 2019

SRE stands for Site reliability engineering. As stated in the name SRE means always keep your site up or we can say create ultra-scalable and highly reliable software systems that will never ever fail.

Site Reliability Engineering (SRE) at Google:

Being a multinational, publicly-traded organization Google always believes in customer satisfaction and focuses on making the things available to customers any time they want. Google focuses on making the site “Google” i.e. always available and reliable. Normally this goal would have been very hard to achieve but being Google it is quite easy as Google believes in taking new challenges and doing something extraordinary.

Google uses SRE to build, deploy, monitor, and maintain some of the largest software systems in the world. SRE is about automating all the things that happen between writing the code and the code going into service. This ensures that human error and machine failures are not exposed to users.

Role of site reliability engineer at Google:

The Site Reliability Engineering (SRE) at Google is responsible for making many things happen. SRE’s focuses on building tools that allow “normal” software engineers to write the code that can run on any machine. This code is also configured so that the code is running on the relevant machines. Also, when the machine goes down it handles the movement of the load from one machine to another machine, essentially and transparently. The important thing which SREs do at Google has automated their jobs to such an extent that they are almost non-existent. Very little things are done manually, except perhaps writing some initial configuration files.

Google’s Site Reliability Engineering (SRE) Team:

Google SRE’s are certainly not only support engineers. Actually, SRE is where the operations jobs are mostly done by automation, and the SREs are there to teach the automation to do new things and also, fix it when it goes wrong.

SRE is responsible for important functionalities like performance, emergency response, availability, efficiency, latency, change management, monitoring, and capacity planning.

Google Team is hiring people with certain skill sets for the SRE team. Generally, they hire 50% software engineers and the remaining 40–50% are candidates who were very close to the Google Software Engineering qualifications. The purpose of this approach of hiring the SRE team is that they want people who will quickly become bored of performing the tasks manually and they will have the necessary skill set to write efficient software to replace their previous manual work.

Principles of Site Reliability Engineering at Google:

Site Reliability Engineering (SRE) has its own principles to follow and those are the basic building blocks.

1. Embracing Risk

Maximizing a system’s stability is both pointless and counterproductive. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won’t notice extreme availability (like 99.99999%) because the quality of their experience is dominated by less reliable components. Having a 100% availability requirement severely limits a team’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners are focused on reliability and thus they can choose a higher SLO. The SRE discipline quantifies this acceptable risk as an “error budget.” When error budgets are depleted, the focus shifts from feature development to improving reliability.

2. Service Level Objectives

The Site Reliability Engineering (SRE) discipline collaboratively decides on a system’s availability targets and it measures availability with input from engineers, owners of the project and customers. It can be challenging to have a productive conversation about the development of software without a consistent way to describe a system’s uptime and availability. Operations teams are constantly putting out fires and few of which end up being bugs or issues in developer’s code. However, without a clear measurement of uptime, product teams may not agree that reliability is a problem. It was the main motivating factor for developing the SRE discipline.

SRE ensures that everyone agrees on what to do when availability falls out of specification and how to measure availability. This process includes individual contributors at every level. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLIs are metrics over time as request latency, a throughput of requests per second. These are usually aggregated over time and then converted to a rate, average subject to a threshold.

SLOs are targets for the cumulative success of SLIs over a window of time agreed-upon by stakeholders
SLIs, SLOs, and SLAs tie back closer to the DevOps pillar of “measure everything” and one of the reasons we say class SRE implements DevOps.

3. Eliminating Toil

Toil is not just working which one doesn’t like to do. For example, below tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of production service. It is work that tends to be repetitive, manual, automatable, tactical and devoid of long-term value. Every time an operator needs to touch a system, such as responding to a page, working a ticket, toil has likely occurred.

The Site Reliability Engineering (SRE) discipline aims to reduce toil by focusing on the “engineering” component of SRE. SRE’s work to engineer a solution to prevent that toil in the future when SREs find tasks that can be automated. Google aims to ensure that at least 50% of each SRE’s time is spent doing engineering projects and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. Having said that, toil is not always bad. Repetitive and predictable tasks are great ways to onboard a new team member and often produce an immediate sense of satisfaction and accomplishment with low risk and low stress.

4. Monitoring Distributed Systems

Monitoring a complex application is quite hard. Even with substantial existing infrastructure for instrumentation, display, collection and alerting, Google Site Reliability Engineering (SRE) team with 10–12 members typically has one or two members whose primary task is to build and maintain monitoring systems for their whole service. The number is decreasing as SRE is generalizing and centralizing common monitoring infrastructure, but every SRE team typically has at least one “monitoring person.” In order to keep low noise and high signal, the elements of your monitoring system that direct to a pager need to be very simple. Rules that are generating alerts for human beings should be simple to understand and represent a clear failure.

5. The Automation

The primary job of SRE is to work on automation so as to improve the system. So, as the SRE tries to work himself out of a job. For example, the cluster can grow and more features can be introduced without having to grow the size of the team.

6. Release Engineering

Release engineers have a better understanding of source code management, build configuration languages, compilers, automated build tools, and installers. Their skill set includes a deep knowledge of multiple domains like development, configuration management, testing, system administration, and happy customer support.

Running reliable services requires reliable release processes. Site Reliability Engineers need to know that the configurations they use are built in a reproducible, automated way so that releases are repeatable and aren’t a unique snowflake. Changes to any aspect of the release process should be always intentional and should not be accidental. SREs always care about all these processes right from source code up to the deployment.

7. Simplicity

Software systems are inherently dynamic and unstable. A system can only be perfectly stable if it exists in a vacuum. If we stop changing the codebase, we stop introducing new bugs. If the libraries never change, neither of these components will introduce bugs. We’ll never have to scale the system if we freeze the current user base. In fact, a good summary of the SRE approach for managing systems is: At the end of the day, our job is to keep agility and stability in balance with the system

Request a quote