GCP – Meeting reliability challenges with SRE principles
You’ve built a beautiful, reliable service, and your users love it. After the initial rush from launch is over, realization dawns that this service not only needs to be run, but run by you! At Google, we follow site reliability engineering (SRE) principles to keep services running and users happy. Through years of work using SRE principles, we’ve found there are a few common challenges that teams face, and some important ways to meet or avoid those challenges. We’re sharing some of those tips here.
In our experience, the three big sources of production stress are:
-
Toil
-
Bad monitoring
-
Immature incident handling procedures
Here’s more about each of those, and some ways to address them.
1. Avoid toil
Toil is any kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. This doesn’t mean toil has no business value; it does mean we have better ways to solve it than just manually addressing it every time.
Toil is pernicious. Without constant vigilance, it can grow out of control until your entire team is consumed by it. Like weeds in a garden, there will always be some amount of toil, but your team should regularly assess how much is acceptable and actively manage it. Project planners need to make room for “toil-killer” projects on an ongoing basis.
Some examples of toil are:
-
Ticket spam: an abundance of tickets that may or may not need action, but need human eyes to triage (i.e., notifications about running out of quota).
-
A service change request that requires a code change to be checked in, which is fine if you have five customers. However, if you have 100 customers, manually creating a code change for each request becomes toil.
-
Manually applying small production changes (i.e., changing a command line, pushing a config, clicking a button, etc.) in response to varying service conditions. This is fine if it’s required only once a month, but becomes toil if it needs to happen daily.
-
Regular customer questions on several repeated topics. Can better documentation or self-service dashboards help?
This doesn’t mean that every non-coding task is toil. For example, non-toil things include debugging a complex on-call issue that reveals a previously unknown bug, or consulting with large, important customers about their unique service requirements. Remember, toil is repetitive work that is devoid of enduring value.
How do you know which toilsome activities to target first? A rule of thumb is to prioritize those that scale unmanageably with the service. For example:
-
I need to do X more frequently when my service has more features
-
Y happens more as the size of service grows
-
The number of pages scale with the service’s resource footprint
And in general, prioritize automation of frequently occurring toil over complex toil.
2. Eliminate bad monitoring
All good monitoring is alike; each bad monitoring is unique in its own way. Setting up monitoring that works well can help you get ahead of problems, and solve issues faster. Good monitoring alerts on actionable problems. Bad monitoring is often toilsome, and some of the ways it can go awry are:
-
Unactionable alerts (i.e., spam)
-
High pager or ticket volume
-
Customers asking for the same thing repeatedly
-
Impenetrable, cluttered dashboards
-
Service-level indicators (SLIs) or service-level objectives (SLOs) that don’t actually reflect customers’ suffering. For example, users might complain that login fails, but your SLO dashboard incorrectly shows that everything is working as intended. In other words, your service shouldn’t rely on customer complaints to know when things are broken.
-
Poor documentation; useless playbooks.
Discover sources of toil related to bad monitoring by:
-
Keeping all tickets in the same spot
-
Tracking ticket resolution
-
Identifying common sources of notifications/requests
-
Ensuring operational load does not exceed 50%, as prescribed in the SRE Book
3. Establish healthy incident management
No matter the service you’ve created, it’s only a matter of time before your service suffers a severe outage. Before that happens, it’s important to establish good practices to lessen the confusion in the heat of outage handling. Here are some steps to follow so you’re in good shape ahead of an outage.
Practice incident management principles
Incident management teaches you how to organize an emergency response by establishing a hierarchical structure with clear roles, tasks, and communication channels. It establishes a standard, consistent way to handle emergencies and organize an effective response.
Make humans findable
In an urgent situation, the last thing you want is to scramble around trying to find the right human to talk to. Help yourselves by doing the following:
-
Create your own team-specific urgent situation mailing list. This list should include all tech leads and managers, and maybe all engineers, if it makes sense.
-
Write a short document that lists subject matter experts who can be reached in an emergency. This makes it easier and faster to find the right humans for troubleshooting.
-
Make it easy to find out who is on-call for a given service, whether by maintaining an up-to-date document or by writing a simple tool.
-
At Google, we have a team of senior SREs called the Incident Response Team (IRT). They are called in to help coordinate, mitigate and/or resolve major service outages. Establishing such a team is optional, but may prove useful if you have outages spanning multiple services.
Establish communication channels
One of the first things to do when investigating an outage is to establish communication channels in your team’s incident handling procedures. Some recommendations are:
-
Agree on a single messaging platform, whether it be Internet Relay Chat, Google Chat, Slack, etc.
-
Start a shared document for collaborators to take notes in during outage diagnosis. This document will be useful later on for the postmortem. Limit permissions on this document to prevent leaking personally identifiable information (PII).
-
Remember that PII doesn’t belong in the messaging platform, in alert text, or company-wide accessible notes. Instead, if you need to share PII during outage troubleshooting, restrict permissions by using your bug tracking system, Google docs, etc.
Establish escalation paths
It’s 2am. You’re jolted awake by a page. Rubbing the sleep from your eyes, you fumble around the dizzying array of multi-colored dashboards, and realize you need advice. What do you do?
Don’t be afraid to escalate! It’s OK to ask for help. It’s not good to sit on a problem until it gets even worse—well-functioning teams rally around and support each other.
Your team will need to define its own escalation path. Here is an example of what it might look like:
-
If you are not the on-call, find your service’s on-call person.
-
If the on-call is unresponsive or needs help, find your team lead (TL) or manager. If you are the TL or manager, make sure your team knows it’s OK to contact you outside of business hours for emergencies (unless you have good reasons not to).
-
If a dependency is failing, find that team’s on-call person.
-
If you need more help, page your service’s panic list.
-
(optional) If people within your team can’t figure out what’s wrong or you need help coordinating with multiple teams, page the IRT if you have one.
Write blameless postmortems
After an issue has been resolved, a postmortem is essential. Establish a postmortem review process so that your team can learn from past mistakes together, ask questions, and keep each other honest that follow-up items are addressed appropriately.
The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well-understood, and that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.
All postmortems at Google are blameless postmortems. A blameless postmortem assumes that everyone involved had good intentions and responded to the best of their ability with the information they had. This means the postmortem focuses on identifying the causes of the incident without pointing fingers at any individual or team for bad or inappropriate behavior.
Recognize your helpers
It takes a village to run a production service reliably, and SRE is a team effort. Every time you’re tempted to write “thank you very much for doing X” in a private chat, consider writing the same text in an email and CCing that person’s manager. It takes the same amount of time for you and brings the added benefit of giving your helper something they can point to and be proud of.
May your queries flow and the pager be silent! Learn more in the SRE Book and the SRE Workbook.
Thanks to additional contributions from Chris Heiser and Shylaja Nukala.
Read More for the details.