Founded by Google SRE alumni, it is no surprise that Loon’s Production Engineering/SRE team instituted a culture of blameless postmortems that became a key feature of Loon’s approach to incident response. Blameless postmortems originated as an aerospace practice in the mid-20th century, so it was particularly fitting that they came full circle to be used at a company that melded cutting edge aerospace work with the development of a communications platform and the world’s first stratospheric temporospatial software defined network. The use of postmortems became a standardizing factor across Loon’s teams— from avionics and manufacturing, to flight operations, to software platforms and network service. This blog post discusses how Loon moved from a heterogeneous approach to postmortems to eventually standardize and share this practice across the organization— a shift that helped the company move from R&D to commercial service in 2020.
Background
Postmortems
Many industries have adopted the use of postmortems— they are fairly common in high-risk fields where mistakes can be fatal or extremely expensive. Postmortems are also widespread in industries and projects where bad processes or assumptions can incur expensive project development costs and avoiding repeat mistakes is a priority. Individual industries and organizations often develop their own postmortem standards or templates so that postmortems are easier to create and digest across teams.
Blameless postmortems likely originated in the healthcare and aerospace industries in the mid-20th century. Because of the high cost of failure, these industries needed to create a culture of transparency and continuous improvement that could only come from openly discussing failure. As the original SRE book states, blameless postmortems are key to “an environment where every ‘mistake’ is seen as an opportunity to strengthen the system.”
The goal of a postmortem is to document an incident or event in order to foster learning from it, both among the affected teams and beyond. The postmortem usually includes a timeline of what happened, the solutions implemented, the incident’s impact, the investigation into root causes, and changes or follow-ups to stop it from happening again. To facilitate learning, SRE’s postmortem format includes both what went well— acknowledging the successes that should be maintained and expanded— and what went poorly and needs to be changed. In this way, postmortem action items are key to prioritizing work that ensures the same failures don’t happen again.
Loon
Loon aimed to supply internet access to unserved and underserved populations around the world by providing connectivity via stratospheric balloons. These high altitude “flying cell towers” covered a much wider footprint than a terrestrial tower, and could be deployed (and repositioned) into the most remote corners of the earth without expensive overland transportation and installation. As the first company to attempt anything like this, Loon dealt with a number of systems that were complex, challenging, or novel: superpressure balloons designed to stay aloft for hundreds of days, wind-dependant steering, a software defined network consisting of constantly moving nodes, and extremes of temperature and weather at 20km above Earth’s surface.
Prod Team
The initial high-risk operations of Loon’s mission were avionic: could we launch and steer balloons carrying a networking payload long enough to reach and serve the targeted region? As such, the earliest failure reports within Loon (which weren’t officially called “postmortems” at the time) mostly involved balloon construction or flight, and drew on the experience of team members who had worked in the Avionics, Reliability Engineering, and/or Flight Safety fields. As Loon’s systems evolved and matured, they started to require operational reliability, as well. Just before graduating from a purely R&D project in Google’s “moonshot factory” incubator X to a company with commercial goals, Loon started building a Site Reliability Engineering (SRE) team known internally as Prod Team.
In order to effectively offer internet connectivity to users, Loon had to solve network serving failures with the same rigor as hardware failures. Prod Team took the lead on a number of practices to improve network reliability. The Prod Team had three primary goals:
Ensure that the fleet’s automation, management, and safety-critical systems were built and operated to meet the high safety bar of the aviation industry.
Lead the integration of the communications services (e.g., LTE) end to end.
Own the mission of fielding and providing a reliable commercial service (Loon Library) in the real world.
Postmortems at Loon
The Early Days
Postmortems were one tool for reaching Prod Team’s (SRE’s) goals. Prod Team often interacted with SREs in other infrastructure support teams that the Loon service connected to, such as the team developing the Evolved Packet Core (EPC), our telco partner counterparts, and teams that handle edge network connectivity. Postmortems provided a common tool for sharing incident information across all these teams, and could even span multiple companies when upstream problems impacted customers.
At Loon, postmortems served the following goals:
Document and transcribe the events, actions, and remedies related to an incident.
Provide a feedback loop to rectify problems.
Indicate where to build better safeguards and alerts.
Break down silos between teams in order to facilitate cross-functional knowledge sharing and accelerate development.
Identify macro themes and blind spots over the longer term.
The combination of aerospace and high tech brought two strong practices of writing postmortems, but also the challenge of how to own, investigate, or follow up on problems that crossed those boundaries, or when it wasn’t clear where the system fault lay.
Loon’s teams across hardware, software, and operations orgs used postmortems, as was standard practice in their fields for incident response. The Flight Operations Team, which handled the day-to-day operations of steering launched balloons, captured in-flight issues in a tracking system. The tracking system was part of the anomaly resolution system devised to identify and resolve root cause problems. Seeking to complement the anomaly resolution system, the Flight Operations Team incorporated the SRE software team’s postmortem format for incidents that needed further investigation— for example, failure to avoid a storm system, deviations from the simulated (expected) flight path that led to an incident, and flight operator actions that directly or indirectly caused an incident. Given that most incidents spanned multiple teams (e.g., when automation failed to catch an incorrect command sent by a flight operator, which resulted in a hardware failure), utilizing a consistent postmortem format across teams simplified collaboration.
The Aviation and Systems Safety Team, which focused on safety related to the flight system and flight process, also brought their own tradition and best practices of postmortems. Their motto, “Own our Safety”, brought a commitment to continually improving safety performance and building a positive safety culture across the company. This was one of the strengths of Loon’s culture: all the organizations were aligned not just on our audacious vision to “connect people everywhere”, but also on doing so safely and effectively. However, because industry standards for postmortems and how to handle different types of problems varied across teams, there was some divergence in process. We proactively encouraged teams to share postmortems between teams, between orgs, and across the company so that anyone could provide feedback and insight into an incident. In that way, anyone at Loon could contribute to a postmortem, see how an incident was handled, and learn about the breadth of challenges that Loon was solving.
Challenges
While everyone agreed that postmortems were an important practice, in a fast moving start-up culture, it was a struggle to comprehensively follow through on action items. This probably comes as no surprise to developers in similar environments— when the platform or services that require investment are rapidly changing or being replaced, it’s hard to spend resources on not repeating the same mistakes. Ideally, we would have prioritized postmortems that focused on best practices and learnings that were applicable to multiple generations of the platform, but those weren’t easy to identify at the time of each incident.
Even though the company was not especially large, the novelty of Loon’s platform and interconnectedness of its operations made determining which team was responsible for writing a postmortem and investigating root causes difficult. For example, a 20 minute service disruption on the ground might be caused by a loss of connectivity from the balloon to the backhaul network, a pointing error with the antennae on the payload, insufficient battery levels, or wind that temporarily blew the balloon out of range. Actual causes could be quite nuanced, and often were attributable to interactions between multiple sub-systems. Thus, we had a chicken-and-egg problem: which team should start the postmortem and investigation, and when should they hand off the postmortem to the teams that likely owned the faulty system or process? Not all teams had a culture of postmortems, so the process could stall depending on the system where the root cause originated. For that reason, Loon’s Prod Team/SREs advocated for a company-wide blameless postmortem culture.
Much of how Loon used postmortems, especially in software development and Prod Team, was in line with SRE industry standards. In the early days of Loon, however, there were no service level objectives or agreements (SLO/As). As Loon was an R&D project, we wrote postmortems when a test network failed to boot after launch, or when performance didn’t meet the team’s predictions, rather than for “service outages”. Later on, when Loon supplied commercial service in disaster relief areas in Peru and Kenya, the Prod Team could more clearly identify the types of user-facing incidents that required postmortems due to failure to meet SLAs.
Improving and Standardizing Loon’s Postmortem Processes
Moving Loon from an R&D model to the model of reliability and safety necessary for a commercial offering required more than simply performing postmortems. Sharing the postmortems openly and widely across Loon was critical to building a culture of continuous improvement and addressing root causes.
To increase cross-team awareness of incidents, in 2019 we instituted a Postmortem Working Group. In addition to reading and discussing recent postmortems from across the company, the goals of the working group were to make it easier to write postmortems, promote the practice of writing postmortems, increase sharing across teams, and discuss the findings of these incidents in order to learn the patterns of failure. Its founding goal was to “Cultivate a postmortem culture in Loon to encourage thoughtful risk taking, to take advantage of mistakes, and to provide structure to support improvement over time.” While the volume of postmortems could ebb and flow across weeks and months, over multiple years of commercial service we expected to be able to identify macro-trends that needed to be addressed with the cooperation of multiple teams.
In addition to the Postmortem Working Group, we also created a postmortem mailing list and a repository of all postmortems, and presented a “Lunch & Learn” on blameless postmortems (see example slide below). Prod Team and several other teams’ meetings had a standing agenda item to review postmortems of interest from across the company, and we sent a semi-annual email celebrating Loon’s “best-of” recent incidents: the most interesting or educational outages.