GCP – SRE at Google: Our complete list of CRE life lessons
In 2016 we announced a new discipline at Google, Customer Reliability Engineering, an offshoot of Site Reliability Engineering (SRE). Our goal with CRE was (and still is) to create a shared operational fate between Google and our Google Cloud customers, to give you more control over the critical applications you’re entrusting to us. Since then, here on the Google Cloud blog, we’ve published a wealth of resources to help you take the best practices we’ve learned from SRE teams at Google and apply them in your own environments.
Below is the complete list of CRE life lessons posts we’ve published in the past five years in one convenient location.
Common pitfalls
Know thy enemy: How to prioritize and communicate risks
How to avoid a self-inflicted DDoS Attack
Using load shedding to survive a success disaster
Service-level metrics
Available . . . or not? That is the question
Consequences of SLO violations
Applying the escalation policy
Defining SLOs for services with dependencies
Learning—and teaching—the art of service-level objectives
Using deemed SLIs to measure customer reliability
Releases
Reliable releases and rollbacks
How release canaries can save your bacon
SRE support
Why should your app get SRE support?
How SREs find the landmines in a service
Making the most of an SRE service takeover
Dark launches
What is a dark launch, and what does it do for me?
The practicalities of dark launching
Postmortems
Getting the most out of shared postmortems
Error Budgets
Good housekeeping for error budgets
Understanding error budget overspend
Production Incidents
Shrinking the impact of production incidents using SRE principles
Shrinking the time to mitigate production incidents
We still have plenty more articles to come, so keep your eye on our DevOps & SRE channel. You can also check out sre.google or read our SRE books online.
Read More for the details.