GCP – 5 resources to help you get started with SRE
Site reliability engineering (SRE) is an essential part of engineering at Google—it’s a mindset, and a set of practices, metrics, and prescriptive ways to ensure systems reliability. But not everyone knows the best places to start to implement SRE in their own organizations. Here are our top resources at Google Cloud for getting started.
1. Do you have an SRE team yet? How to start and assess your journey
We’re often asked what implementing SRE means in practice, since our customers face challenges quantifying their success when setting up their own SRE practices. In this post, we share a couple of checklists to be used by members of an organization responsible for any high-reliability services. These will be useful when you’re trying to move your team toward an SRE model. Implementing this model at your organization can benefit both your services and teams due to higher service reliability, lower operational cost, and higher-value work for everyone on the team.
2. SRE fundamentals: SLIs, SLAs and SLOs
Core to the definition of SRE is the idea that metrics should be closely tied to business objectives. Thus, a big part of the day-to-day of SREs is establishing and monitoring these service-level metrics. At Google, we use several essential measurements—SLO, SLA and SLI—in SRE planning and practice. This post gives you an overview of what each of these acronyms are, what they mean, and how to incorporate them.
3. How SRE teams are organized, and how to get started
You know what SREs do and understand which best practices should be implemented at various levels of SRE maturity. Now you’re ready to take the next step by setting up your own SRE team. In this post, we’ll cover how different implementations of SRE teams establish boundaries to achieve their goals. We describe six different implementations that we’ve experienced, and what we have observed to be their most important pros and cons.
4. Meeting reliability challenges with SRE principles
Through years of work using SRE principles, we’ve found there are a few common challenges that teams face, and some important ways to meet or avoid those challenges. Learn what we at Google think are the three top sources of production stress and how we recommend addressing them.
5. Transitioning a typical engineering ops team into an SRE powerhouse
Perpetually adding engineers to ops teams to meet customer growth doesn’t scale. Google’s SRE principles can help, bringing software engineering solutions to operational problems. In this post, we’ll take a look at how we transformed our global network ops team by abandoning traditional network engineering orthodoxy and replacing it with SRE. You’ll learn how Google’s production networking team tackled this problem and consider how you might incorporate SRE principles in your own organization.
Lots more to read
Can’t wait to read more about SRE? We wrote an entire book on SRE to help you get started (actually, we’ve written more than one). You can also find all our DevOps and SRE blog content or follow our columns on Customer Reliability Engineering.
Read More for the details.