GCP – With SRE, failing to plan is planning to fail
People sometimes think that implementing Site Reliability Engineering (or DevOps for that matter) will magically make everything better. Just sprinkle a little bit of SRE fairy dust on your organization and your services will be more reliable, more profitable, and your IT, product and engineering teams will be happy.
It’s easy to see why people think this way. Some of the world’s most reliable and scalable services run with the help of an SRE team, Google being the prime example.
For almost two decades, I’ve lived and breathed running production systems at large scale. I had to think about tradeoffs, reliability, costs, implementing a variety of architectures with different constraints and requirements—all while getting paged in the middle of the night. More recently, I’ve had the privilege to leverage that experience and knowledge to help Google Cloud customers modernize their infrastructure and applications, including implementing an SRE practice. While these learnings look different from organization to organization, there are common lessons learned that will impact the success of your deployment.
When problems do arise, it’s usually not because of technical challenges. A stalled SRE culture is usually a business process failure—goals weren’t properly defined up front and stakeholders weren’t properly engaged. After watching this play out repeatedly, I’ve developed some advice for technology leaders about how to implement a successful SRE practice.
Before you start
Your SRE journey should start well before you read your first manual, or put in your first call to an SRE advisor. As a technology leader within your organization, your first job is to answer a few key questions and gather some basic facts.
What problem are you trying to solve?
Most organizations will readily admit they’re not perfect. Perhaps you need to reduce toil, be more innovative, or release software faster. SRE, as a framework for operating large scale systems reliably, can certainly help with those goals. To do that, it’s important to understand your motivations and what gaps or needs exist in your organization.
Ask yourself what the organization is trying to achieve from the transformation. What worries the organization about reliability? For SRE to be successful and efficient, it is crucial to start with the pain. Starting by identifying what you are trying to solve will not just help you solve it; it will help your organization be more focused, align the relevant parties to a common goal, and make it easier to gain decision-makers’ buy-in (and much more).
Once you understand the problem you are trying to solve, you need to know when you have “solved” it (e.g. how you will define success). Setting goals is critical—otherwise, how will you know if you have improved? We’ll discuss how to set up metrics to help in this self-evaluation in a later post.
Who are the key decision-makers in the organization?
Even though implementing SRE principles involves engineering at its core, it’s actually more of a transformation process than a technological challenge. As such, it will likely require procedural and cultural changes.
As with any business transformation, you need to identify the relevant decision-makers up front. Who those people are depends on the organization, but it usually includes stakeholders from product, operations, and engineering leadership, though these can be named differently in various organizations and can even be separated under multiple organizations. Identifying those decision makers can be especially difficult in a siloed organization. It is important to take the time and reach out to different groups to identify the key stakeholders and influencers (it will save you a lot of time later on). Make sure that you are throwing a wide enough net. It is important to get input from different groups with different requirements (e.g., security).
At the same time, try to be flexible. It’s okay if your list of decision makers gets updated and fine-tuned during the process. Like in other engineering domains, the goal is to start simple and iterate.
Get buy-in and build trust
Once you’ve identified the relevant decision makers, make sure you have support from your colleagues, and the rest of the organization’s leaders. Creating an empowered culture is critical for implementing the core principles of SRE: a learning culture that accepts failures, that facilitates blamelessness and creates psychological safety, all while prioritizing gradual changes and automation.
From my experience, you cannot drive real change in an organization without widespread support and buy-in from leadership and decision-makers—and that’s especially true for SRE. Implementing SRE, similar to DevOps, requires collaboration between different functions in the organization (product, operations and development). In most organizations, those functions fall under separate leadership chains, each with its own processes. If you’re going to align those goals and procedures, leadership needs to prioritize the change. At the same time, driving cultural change from the bottom up can be more challenging and take longer than top-down mandates, and in some cultures will be impossible. In short, leading by example and enabling the people in the organization are critical for driving change and fostering the ‘right’ culture.
Remember: it’s a marathon, not a sprint
The journey to SRE combines several challenges, both from technical and human (culture, process, extra) perspectives, and those are intertwined. To be successful, leadership needs to prioritize organizational changes, allocating resources for engineering excellence (quality and reliability) and fostering cultural principles like reducing silos, blamelessness and accepting failure as normal.
Align expectations! All parties involved in an SRE implementation—from product and engineering to leadership—will need to recognize that change takes time and effort, and in the short term—resources. Daunting as it may be, SRE’s goal is to solve hard problems and build for a better tomorrow.
Interested in getting deeper with SRE principles? Check out this Coursera course for leaders, Developing a Google SRE Culture. And stay tuned for my next post, where I outline some tactical considerations for teams that are early on their SRE journey, from identifying the right teams to start with, enablement and building community.
Read More for the details.