GCP – Preparing for peak holiday shopping in 2020: War rooms go virtual
As retail gets ready for an unprecedented holiday 2020, it’s not just shoppers who are looking to go digital-first. Retail war rooms – traditionally a single large room where core IT and business teams gather together to ensure systems keep running, websites don’t crash and items stay in stock – are also looking different due to COVID-19.
For many retailers, holiday war rooms in 2020 are going to be scattered across multiple living rooms, couches, garages, and kitchens as people are working remotely. Managing this type of large scale, high visibility event 100% virtually (remotely, digitally, it all means the same thing) is a first for many.
This year many of our customers have elected to use our Black Friday/Cyber Monday (BFCM) white glove service. We are working with leading retailers such as Macy’s, The Home Depot, and Tokopedia as they take their war rooms 100% virtual.
Implementing virtual war rooms has been a crucial part of our ability to respond to increased site traffic and sales allowing us to respond quickly and keep our more than 100 million monthly active users happy
The good news is this has been done before. Google has been running virtual war rooms for many years when conducting large product launches, incident responses, and our own Black Friday/Cyber Monday activities. We’ve created guidance for preparing, running, and evaluating Black Friday/Cyber Monday and an extended holiday season virtual war room. We want customers to make sure their response to such an important peak event is as responsive and efficient as it has been in past years. These best practices will help teams navigate consumer behavior uncertainty this season and the corresponding system demands to provide continuous uptime and exceptional customer experience.
Step 1: Preparing for the event
Gather important information
Start preparing to manage what is often the largest and most important event of the year for your business by ensuring that all information that may be necessary during the event is easily available, clearly documented, and quickly accessible by all members of the war room. Remember that any communications may incur delays – it won’t be possible to simply walk over to your teammate’s desk and ask them a question.
Communication
First, determine the exact communication tools and approaches you will use both during normal event management and when you shift to emergency or incident response. Specify both group- and team-wide communication expectations (i.e. chat channels, conference bridges, etc.) and how folks will be able to communicate one-on-one should direct escalation or clarification be necessary. These expectations should be as clear and straightforward as possible so that there is no confusion, especially if you have to manage an incident during an already stressful time. Consider backup plans for each – what will you do if your selected chat platform experiences an outage, for example?
One specific recommendation is to standardize date/time formats in all communication – this is especially important if your team is distributed across multiple time zones. Communication should be as unambiguous as possible, and having to clarify that you were referring to your local time zone rather than the next oncaller’s when describing an event you’re handing over adds confusion and possible delays in response.
Another critical component of communication is enabling people to get the information they need without having to ask others. To that end, consider using or creating a dedicated status page or Google Group that provides an overall health of the systems involved in the event and links to additional details, such as relevant monitoring and/or logging consoles. The objective is to allow those who need to know what’s happening to get that information at a glance and not require additional communication. A key recommendation is to designate a specific and known owner of this page to be responsible for updating it on a predetermined schedule.
Expectations
Next, ensure that there is a clear definition of staffing, roles, and expectations that includes both normal and emergency contact methods. Create a list of team members who will be involved in the event and how they may be reached directly (typically on their mobile phone or via pager) should the need arise. If you’ll be using a rotation system, document it clearly and create a prescriptive plan for how hand-offs of both normal operations and escalations will be handled during the event. In either case, be clear about each team’s or individual’s role in the event and about when the emergency method of contact should be used versus the normal one. It will be very helpful to create an explicit “chain of escalation” document, if you don’t have one already. This way the right level of attention is directed at a problem should one arise AND so that people don’t experience overload and burnout during the event, which will likely demand their attention over a prolonged period of time.
This is also a great time to create an expected timeline for the event. As clearly as possible, document when the event will start, what activities will take place during the event itself, and when the event will end.
Finally, consider creating a plan for handling common outage modes you may experience. Ensure your monitoring is ready to detect them and that you have a plan to respond. For example, confirm that the right people are available (e.g. what if you need to spend money quickly to bring up more capacity?) and ready to approve such decisions quickly if needed.
Engagement
In the past, you may have run these events in a dedicated physical space and possibly provided food, entertainment, or other means to keep the team engaged. How will you continue to keep people engaged during their shifts in a virtual environment? Think about sending the team gift cards or treat baskets as a surprise to boost morale when going through this experience virtually.
Do A Test Run
The best way to ensure preparedness is to run through simulations that will let you see how your virtual processes work under pressure. This will help you gauge their effectiveness in solving a situation when problems arise and allow you to handle anything that may come your way.
To prepare for such an exercise, determine the exact scope of what you’d like to test and accomplish. If you’re looking to specifically exercise those aspects of your war room that have changed to virtual, you’re likely going to focus on how information is exchanged in a distributed team. Consider testing your communication tools – both primary and secondary – by using them for normal communication and escalation situations. This should help you determine whether the team has the tools configured appropriately and easily available to them, if there are any issues with useability or accessibility you need to address prior to the event, and whether your expectations of how communication takes place during the event are clear.
Consider running an exercise to validate your timeline of events – both under “normal” operating conditions and during an incident, emergency, or escalation. The latter can be thought of as a Wheel of Misfortune tabletop exercise (template) where your objective is to practice your incident management and response techniques. While the former would be more focused on ensuring that the timeline you have created is realistic, your expectations are clear and well-understood, and that the team is able to act on their assigned responsibilities.
Finally, you may choose to prepare for the event by running a “live” test – either using a DiRT-style or chaos engineering approach and introducing actual failures into your production systems or by running a large-scale load test against a non-production environment. In either case, you will want to treat the test as practice for the actual event and use all of the information you’ve collected in the previous section to respond.
Post Mortem of Preparations and Tests
After preparations and testing have finished, evaluate what went well, what can be improved, and how you can strengthen the war room process itself. This is important to ensure your ability to adapt and keep the event running under any circumstances. However, do not simply focus on those things you need to do to prepare for this year’s event – also try to capture what you can improve long term to be in a better position for future events.
Use the learnings from the tests to improve your plan and address any issues you discover as quickly as possible. Prioritize action items from the post mortem in your engineering work planning leading up to the event, paying special attention to issues of communication and information flow, as those can have a critical impact on the ability of your team to manage this event remotely.
Step 2: During the event
With preparations now complete, it is time for the big event. Due to the extensive planning that has happened already the goal is for things to go smoothly. However it is important to remember the key differentiators of communication, activity logging, and escalation management that affect virtual war rooms due to the remote collaboration.
Communication
The importance of communication during a virtual war room cannot be overstated. A disciplined approach to preparation and following established rules may mean a difference of hours in outage resolution.
Throughout the entire event make sure to have a single chat room that is at the core of your communication strategy. Be prepared, should an actual outage occur, to start additional chat rooms focused on specific issues. For example you might find that a dedicated chat room for the technical team is of great value.
Appoint a single person to be the communications lead. As part of Google’s incident management training it is mandated that during large/huge outages, a communication lead is appointed. This is the person that everyone goes to with questions and provides all outgoing updates, allowing the rest of the team to focus on their specific roles. As stated previously, the communications lead may wish to keep a single Current Status of Event page updated so that anyone can know, at a glance, what’s happening.
Finally, be especially vigilant about transferring information during shift handovers. With an up-to-date status page and logs, this may be trivial. However, always get an explicit acknowledgement from the party taking over the shift, especially when transferring roles like the communications lead and decision maker. During the preparation phase the contacts list that was created should reflect any team members due to come oncall during the virtual war room. Teams handing over should be prepared to perform handover duties which could include informing war room members on the chat who is about to come on call and who they replace.
Logging
In order to easily reconstruct what happened during the event later, when you are writing a retrospective or post-mortem, try to keep a log of everything that happens. Make sure your chat rooms have history turned on. Nominate dedicated note takers, but encourage everyone to keep a log of actions taken and events they’ve noticed. (Google Forms can be an easy solution here. Setup the simplest possible form with a single text field, and make sure it records the timestamp. Encourage everyone to enter information, you can deduplicate later.)
Make sure to set a cadence for updating the status. Even if nothing interesting happens, post an update anyway.
Escalation
Be prepared to handle expected and unexpected emergencies. Make sure you always have a single dedicated decision maker that makes the call on what should happen next. If multiple people feel empowered to make unilateral decisions and production changes at the same time, you are much more likely to exacerbate the situation and prolong the outage.
Dealing with an outage is an important area to master in itself, whether in person or remote. Some good starting points to learn more about how to handle incidents include the Managing Incidents chapter in the SRE book and the followup Incident Response chapter in the SRE Workbook.
Step 3: Post event
After the event concludes, you should conduct a post mortem of the entire process. The three pieces of information you want to collect are: what went well, what went wrong, and where did you get lucky.
Note through all three of these sections, you want to keep the investigation blameless. Avoid statements like “X did something”, and instead use “thing was done”. If you want to make sure there is an audit trail, you can add a link to the code or an audit log, but the goal of this document is to highlight system issues and successes, not point to a person.
The topic of this post mortem should focus on details about the virtual war room itself. We recommend that teams write two postmortems: one about the event (e.g. we made a million dollars!) and one about the virtual war room operations. When filling out the three sections, consider some of the following prompts:
-
How did communications go? Did everyone know what was happening and when?
-
If there was an outage, did it follow the normal flow?
-
Did everyone have the correct permissions?
-
Did everyone know what to do and when?
-
Were conversations had in lots of different mediums, or were they all in one space?
-
Did we communicate with our vendors well?
-
Was the war room run for long or short enough?
-
Did we learn things that we could apply to our normal operations?
Make sure everyone involved in the event and war room has a chance to contribute, but one person should be the owner. After folks have had a chance to comment and expand it, publish it to the whole company so everyone can learn from how you ran your virtual war room!
If you want to learn more about writing post mortems, check out the following resources:
The approach above might look daunting, but by following it with the right methodology and organizational mindset you can execute a successful holiday season and lay the groundwork for a responsive and secure virtual war room. And remember, the Google Cloud team is here to help. To learn more about getting started on Black Friday / Cyber Monday, any other upcoming event preparations, or general best practices to manage risk reach out to your Technical Account Manager or contact a Google Cloud account team.
A special thanks to Yuri Grinshteyn, Site Reliability Engineer / CRE; Nat Welch, Site Reliability Engineer / CRE; Ahsan Khan, Program Manager; Dan Tulovsky, Site Reliability Engineer / CRE; Fabian Elliott, Technical Account Manager, for their contributions to this blog post.
Read More for the details.