Site Reliability Engineering (SRE) is a discipline created by Google engineers that replaces the traditional approach to operations with something nimbler. It applies engineering expertise to operations and infrastructure problems, which allows for reliability at scale, quicker deployments, and a well-defined system environment. Here’s an overview of the SRE model and considerations for incorporating it into your development process.
The main objective of SRE teams is to develop highly reliable and scalable software applications or systems. They are accountable for the availability, performance, effectiveness, emergency response, and monitoring of their software. Google Site Reliability Engineers developed the following principles to help SRE teams fulfill their mission:
- Embrace risk
- Utilize Service Level Objectives
- Eliminate toil
- Monitor distributed systems
- Leverage automation and embrace simplicity.
How It Works
Since its inception, one of SRE’s main goals is to use automation to create self-healing systems. Well-automated systems shrink the gap between the development team (those building things) and the operations team (those hosting and maintaining platforms).
Another key tenet of the SRE approach is that site reliability engineers write code themselves. It’s a major change from the traditional operations approach but is key to making SRE work. At Google, they rely on metrics to ensure site reliability engineers are spending enough time writing code to update and maintain their automated systems. For example, a site reliability engineer should spend no more than 50% of their time on traditional operations tasks, such as working tickets.
SRE’s that write code to create and maintain the platforms that their software runs on tend to follow more DevOps best practices. They run code through CI/CD pipelines, execute tests against the changes, and get peer review on it all.
Benefits of SRE
Incorporating aspects of software engineering into the operations and infrastructure functions has numerous benefits, the most notable being more constant uptime and service resiliency. Other benefits SRE offers include
- Filling the gap between developers and infrastructure
- Continuously monitoring and analyzing application performance
- Planning and maintaining operational runbooks
- Contributing to the overall product roadmap
- Managing on-call and emergency support
- Ensuring software has useful logging and diagnostics.
Is SRE a Good Fit For You?
There are two essential things to think about when evaluating if SRE is right for your organization.
- The platforms that you host and manage: Do you run a large system where you are maintaining your own internal platforms, or are you heavily leveraged to use PaaS and SaaS? If you don’t have a large internal footprint, it may not be the best choice for you.
- The skillsets of the people who would fill these roles: There will be additional training needed, whether it’s developers learning more about the infrastructure side of the house, or traditional system admins adding development to their roles for the first time.
While there is certainly more to consider, these are a few of the main things to look at when evaluating if SRE would be a good fit for your business. if you have additional questions. We’re here to help!