Incident Management with Adrienne Walcer

Durée: 39m57s

Date de sortie: 24/05/2022

Adrienne Walcer discusses how to approach and organize incident management efforts throughout the production lifecycle.

Visit https://sre.google/prodcast for transcripts and links to further reading.

View transcript

Hi everyone, welcome back to the SRE broadcast. This is a limited series where we're discussing SRE concepts with experts around Google. I'm Viv and I'm hosting with MP today.
Hi MP. We have a special guest today, Adrienne. So hello and introduce yourself.
Sure. Hey y'all. I'm Adrienne Walser. I'm a nine year Googler and a technical program manager in site reliability engineering.
I'm also the program lead for incident management at Google. You may have seen me before through a wrote a book thing.
I'm still self-conscious about it because I'm pretty sure my mom has just downloaded it several thousand times. She's lying to me about it.
It's called Anatomy of an Incident. That is with O'Reilly Publishing.
And I also have the pleasure of hanging out at the last Eucenix-Lisa conference for all of you large scale distributed systems nerds.
Wow, what a weird and specific thing to be nerdy about. Yeah. How are you all today?
We're great. We're glad to have you here. And I am glad you know so much about incident management because that is what we are here to talk about today.
Nice.
So in our previous episode, we talked a little bit about just like generally being on call.
And now we're kind of at the point in the process where like, hey, it's whatever your favorite day of the week is, and you're at home like chilling or if you're me taking a nap.
And your pager goes off. So I guess the big question is like then what or I don't know if you want to back up a little bit.
But where do we start when we talk about incident management?
I've noticed that there are two primary schools of thought around incident management.
The first school of thought is that you manage an incident and then it's over.
You move on incidents.
Sorry, single point in time event that occurs that you need to get through in order to go back to doing whatever you were previously doing.
The second school of thought is that incident management is a practice that you do every single day with every single piece of engineering that you touch.
Incident management is a continuous cycle that will exist throughout the life cycle of your system.
I am a firm believer in number two.
So even though your pager is going off at one moment,
I believe pretty firmly that you are managing the incident from the second that you wrote the root cause into your technical stack.
I like it.
Definitely good to back up then.
If you are always in the cycle of incident management, then how do you know where to start when you want to kind of get into best practices?
Great question, Viv.
So there's the difference between your perception of a system and how the system is actually functioning.
Usually we first get involved in the incident management process when a signal has been raised to indicate that our system is no longer in the expected state of homeostasis.
Something has changed and alert has gone off.
A user sent you a nasty email saying, what the heck?
I lost all of my angry birds or something.
You get an indicator that something isn't normal.
Ideally, these are automated, but that's a problem for later us.
And this indication that your system is in some kind of hazard state, some kind of outage state, some kind of incident state.
That's when usually we start engaging with the incident management process because something has been tipped off that, oh, we need to act and we kind of need to act now.
But realistically, we build our incidents and our outages into systems as we engineer them.
Every time we create a dependency, we create a possibility that their dependency won't be available.
Every time we create a config, we create the possibility that that config is going to be out of sync with what your system really needs.
So on a continuous basis, it's the cycle of triggers and a matrix of root causes.
And these root causes are going to be the underlying challenges or issues that your system might have.
And these triggers are environmental or system context shifts that enable those underlying system hazards to become problems, to become incidents, to set off those alerts and set off those pages.
And what those really are are indicators that something isn't the way it usually is.
This doesn't always mean that an incident is necessary and that it's time to raise the alarm and wake everyone up.
No, it means that it's time to apply a little bit of thinking and diagnosis and triage to that system and identify whether or not this marker that has no longer been moving in a normal direction.
Whether that's indicative of something really problematic or whether it's just an environmental shift that your system knows how to handle.
And then the incident management process from there is working through responding to that incident, if that incident exists, helping to mitigate whatever challenges are thrown your way, helping your system recover,
become more stable, grow further, and then eventually another incident will happen again.
So it's a bit of a cycle and we move through these phases of planning and preparation, of incident occurrence, of response, of mitigation and of recovery.
And usually these recovery actions, these things that we take to build a more stable system are the same things that we should be doing in order to better prepare and train and be ready for incidents.
Because those hazards, they already exist.
The state of building something means there is uncertainty as to that thing's future.
It sounds like you're using hazard as a bit of a term of art. Could you define that for us?
Yes.
So when I'm thinking root cause and trigger or hazard, so a hazard can exist in a system for an indefinite period of time.
The system environment needs to shift somehow to turn that hazard into an outage.
And essentially root cause kind of equals the hazard.
In reality, a root cause is usually a very complex matrix of factors of timing of user behaviors.
But it's, it's saying root causal matrix is kind of a mouthful.
So when you say hazard state, that means that your system is in a state in which something is vulnerable to error.
And that's kind of cool.
But that system environment needs to shift to transition that hazard, which can exist in a stagnant kind of non moving state for a while.
Something needs to happen for that to become a problem.
I mean, think of it a little more broadly.
A gas leak in a house can go on for hours, but it's not until an electrical switch flips or a burner lights up before that gas leak turns into an outage or a problem or an incident scenario.
There needs to be some kind of environmental shift.
And that environmental shift, usually we refer to that as the trigger.
So it's related, but it's a separate thing.
And these two work in concert in order to create hazardous or dangerous scenarios.
And when you think about prevention efforts, sometimes it makes sense to work on and make more safe.
And sometimes it's a little bit more difficult to work with one of these underlying hazards within a system.
But sometimes it makes sense to build stronger prevention around these trigger conditions, essentially building a safer environment for the system to operate in.
So you could get more of what you want, which is safety, which is security, which is reliability.
I know it's super nerdy.
Super nerdy.
But these are some of the language bits that we use in the field of systems engineering to talk about broad systems.
They aren't inherently software engineering terms, but you know, SRE is a combination of systems engineering and software engineering.
So it makes sense to borrow from both fields of study.
And there's a lot of really interesting literature and nerdiness, general nerdiness, around the creation of hazard states and fault analysis within systems.
And all of it creates really juicy learning tidbits that I try to get everyone to consume.
And sometimes it's like getting kids to eat their broccoli, unless you have one of those weird kids that loves broccoli, in which case, jealous.
But sometimes folks see the value of some of this deep nerddom.
And I'm hoping that a lot of folks will take this as what it is.
It's a language that can be used in order to describe incidents and incident management and start some of these more critical important conversations either on your team and your company, maybe around your house.
I don't know what you do in your free time.
Going back to the stages of an incident that we were talking about, there's mitigation and recovery and kind of that improvement side of the story.
I'm now hearing about this framing of hazard is brand new to me.
And now I'm thinking about on the recovery side that you can either try to eliminate a hazard, but you can also try to manage a hazard.
You don't necessarily need to, at least, that's my initial intuitive response is you don't necessarily need to eliminate hazards.
You can manage them.
Is that intuition true?
Yeah, I mean, hazards exist everywhere.
And a lot of them are totally out of our control.
Like things we can't inherently control or user behaviors, whether or not your app or video goes viral.
Because of the shout out on a local news show, we can't control the speed of light, the size of the earth, which is woefully inconvenient sometimes.
But all of these things create essentially in some weird way hazard conditions that we need to work around in order to optimize our system for what we think, or maybe what we know, or maybe what we hope our users need.
And here's a really bad analogy.
Being outdoors, kind of hazardous.
There's bad weather.
Maybe there's animals.
Maybe there's bears.
You get bugs.
Just weird stuff happens outdoors.
I'm an engineer, Viv.
I'm indoors.
Don't laugh at me.
No, come on.
We're all indoors here.
I just, I think it's funny because the way you said that made it sound like the outdoors is just like this pairless, like,
nightmare.
Don't step outside.
You know, the mosquitoes will get you.
Do believe the Google Australia sites are hiring if you're interested in a thousand things that you heard earlier.
But outdoors, many hazard conditions, so many hazard conditions.
But how do we work on these set of hazard conditions in order to keep our human body safe?
We live in houses or in buildings or in tents or yurts.
I don't know what people do.
But that's essentially a mitigation from the series of hazard conditions that exist for living in the world.
I know so deep, right?
I am a puddle.
But that, it brings up some really interesting things because incidents happen.
They're weird.
They're confusing.
Maybe we like them.
Maybe we don't.
But we need to figure out what is the best work that your team can do after an incident.
And by thinking about things in terms of this language of like hazards and triggers, you could figure out where your team or where your org or whatever can exert the most control in order to make the most efficacious changes on that system and on that system environment.
You know, your, your incident has been mitigated.
Your system is stable again.
Does it matter that you understand the underlying problems?

Because you want to prepare for the next problems and you don't want the same problems that you just dealt with to come back again.
Something that our SRE vice president Ben trainer has said on numerous occasions is that he only wants new incidents.
You know, if we've seen something before, we don't want to see it again.
We're not into the replay or repeat button on the record player.
Wow, record players.
I am old doing the work to understand what is shifting within your system and how it works.
That'll help you from having those same repeat issues over and over again.
That makes sense.
So you mentioning that you don't want repeat incidents reminded me of, I think we might have jumped a little bit over this, but I'm curious if you've thoughts particularly.
So you said like, okay, yeah, like your incident is mitigated.
Sometimes it's easier said than done, right, especially in the moment.
I guess I'm curious before we even get to the, you know, how do we want to move forward?
How do you make sure this doesn't happen again?
You know, etc.
Whatever steps come afterwards, when you're there in that moment, if you're looking at something that is totally new, you trigger a hazard, you didn't realize like was around.
What can you do?
I don't know.
What is your advice in this case?
Or, you know, another nice outdoor analogy, but a little lost on the analogy right now.
No, no, I feel ya like, I think the most terrifying part of the incident management lifecycle is that you're not going to be able to do anything.
Is that moment when an alert or your pager goes off and there's those few seconds of like, oh gosh, is this a thing I can deal with?
And then there's always that like, is this a thing I can deal with?
Because those are the moments where I always have that instinctive feeling of like, well, time to go and cry to my manager and help that they have sufficient magical wizardry.
They can handle the big evil problems.
So an incident happens, you're staring at something you don't know how to deal with.
I have two things that I love for people to think about at this point in time.
Number one is figure out immediately or as quickly as possible if you're having any user impact.
Because realistically, if your system is sufficiently robust, a little bit of outage might not be super noticeable.
But if your system is quite sensitive to its current balances and configurations, if it can't take a lot of really sudden fluctuations,
what you want to make sure of is that your end user is having an acceptable time and interacting with your system.
So for there, we want to do things called mitigations.
And Jennifer Macy gave a great overview on mitigations and mitigative activities.
I believe she appeared recently on the SRE broadcast.
She did indeed, she's our first guest.
Awesome.
So if you see immediate user impacts, like I love to prioritize having some kind of band-aid to bring about a little bit of stability such that you have time in order to figure out what's really happening within your system.
And depending upon the size of the problem and how many folks you're going to need in order to resolve that problem,
that's where some of the art of incident management comes in.
So here at Google, we use a cool variant on the FEMA incident management system.
We call it IMAG, Incident Management at Google.
And we believe that incidents are an issue that have been escalated and require kind of immediate continuous organized response to address it.
This means you're going to need to organize your team in order to make progress on this issue.
And this means putting together some kind of organizational structure such that you can make progress in parallel with your teammates.
But we can get more into that later.
I said that there are two things I like to think about when a pager immediately goes off.
So the second thing that I like to think about is knowing who is accountable when an incident is happening.
When pagers go off, it's really simple.
Whoever's pager is going off is somehow responsible or accountable for that incident.
But when you think about broader things, it can get really confusing who owns what or what piece of organizational muscle should be activated in order to work on and make progress on a given scenario.
So figuring out in advance who is accountable, who owns the thing, who works on things when something happens.
Those are the things that can buy you a lot of really valuable time during an incident.
Because that deer in headlights response of, oh gosh, who does the thing?
When you multiply that by the number of people on your team, the number of people having that same response, what initially starts out as seconds in figuring out who's accountable, can turn into minutes, can turn into hours, can sometimes turn into days of figuring out what sub team or what person is accountable for resolving an incident or an issue.
And that's where you can really lose organizational velocity and addressing your user's needs.
And it's incidents like those where everyone's trying to figure out who's responsible, where the person who ends up looking like a hero, and I normally don't condone any kind of heroism, is the person who's just willing to raise their hand and move first.
Because when really messy things happen, any kind of leadership is helpful.
You just need someone who's willing to take the first step.
And particularly at Google, we've got a massive organization, like 100,000 plus employees figuring out who's in charge of things and who makes a move when bad stuff happens.
That's a continuous and ongoing challenge.
That's also been some of the places in which we've been most successful during major incidents.
During a couple of really recent big blow-ups, you think of the logs for shell vulnerability in JavaScript, where you just get a little bit of remote code execution.
It's cool, we all do it.
Major security vulnerability, like threatening the whole Internet.
One of the cool things that Google was able to do was we were able to mobilize really quickly.
It took a matter of hours before we had some teams on pretty much every product area working to address these issues and bring things to closure.
And I almost have no idea how it happened, even though I was working on it.
But the ability of a company to mobilize quickly, figure out who's accountable and make some concise steps forward, that's a strength.
That's a huge strength towards business continuity.
And if you and your team ever have that deer in headlights kind of moment, number one, totally okay.
Everyone does it. Everyone does it.
It's cool.
But take a little bit of time and diagnose those challenges.
What are you feeling when you first see an incident?
Is there anything that you wish you were better prepared on?
Is there anything from previous incidents that now scares you to work on them?
Because incident management is about building good habits all across that life cycle.
And if you can take some of these best practices and turn them into habits, you won't be relying upon looking up a playbook to figure out how to resolve something.
You'll have that internal intuitive understanding of what those next steps should be and how we can communicate and work together to resolve an incident.
And that means you will have more uptime and everyone will be happy and will hold hands under a rainbow.
Even though the rainbow is outside, just to clarify.
The rainbow is outside. It is dangerous there.
We have also built windows in order to see the rainbow while being inside.
Because innovation is key for incident management.
Oh my gosh, I'm going to make myself puke.
Oh, it's okay.
Something you said that stuck with me a little bit, that getting slowed down by determining ownership of an issue is, I think I'm usually in my mind, I'm more familiar of that in the recovery stages where it's like it's been mitigated and then it kind of gets lost in the morass of everything else going on.
But I wanted to think more in the moment where I have heard it phrased a few different ways.
My manager likes to use the phrase, you have the keys to the car.
I've heard other folks phrase it as temporary director level authority of like that SRE that has the pager.
You're the one with the pager, you own it.
This is your problem until it's over.
Yep.
But when you get messy distributed stacks where problems can like slip through the cracks or something can happen across multiple systems, you need to determine who's actually in charge.
Yeah.
Yeah.
Oh, that's a, that's a messy one.
But that's where categorizing different types of incident responders can be kind of helpful.
If you have a big enough and messy enough technical stack here at Google.
Oh yeah, it's the messiest.
I love it.
But we have essentially two types of incident responders at Google.
We have component responders.
And these are incident responders on call for one component or system within Google's overall technical infrastructure.
And then we'll also have systems of systems responders.
And these are folks that are on call to support incidents that might span multiple component systems, incidents that fall between system boundaries.
Or sometimes they're just the folks around when anything gets messy.
I mean, the reason that we hire teammates is because often the scope of a technical stack grows beyond one person's capacity to understand and maintain state.
So we split up that technical stack such that multiple component responders can provide coverage on a single component of the whole stack.
For example, you look at our ads team, we have folks on call specifically for video ads, specifically for YouTube ads, specifically for mobile ads, specifically for what do we got Google ads, Google ad, X, we get all the ads.
And there's somebody different who's usually holding the pager for each.
And these are going to be examples of component responders.
We maintain kind of a limited scope primary responder to resolve some issues.
But there we get a risk of like remaining ignorant to production issues that span maybe multiple components or between system boundaries.
Or we also get a chance of not providing individual component responders with sufficient support if an issue proves to be beyond their expertise.
And it's important when somebody is going on call that they don't feel alone.
Oh man, what an experience that must be.
So adding a second line of defense that's a little more holistically focused and provide some real advantages.
So I talked about a couple of different potential component responders for our ads team.
But we also have an overarching ads incident response team.
And these are a group of folks that just have a lot of tenure working on ads products.
And if something arises that goes across multiple components that seeps between multiple ad systems, they're there to provide overall coverage and structure for the whole ads product.
And we do this in a couple of levels at Google, but by differentiating between individual component as well as systems of system spokes, you're able to scale your incident response and kind of build an organization really well designed for the technical enterprise that you've built.
I can spill more details if you need, but I don't know.
I'm just thinking about mapping that to my own experience where I almost feel like my team's a little in between because we have a lot of large number like it's all one system, but it has a lot of like subsystems all maintained by different developer teams.
So it's kind of like each binary is an entirely different development team that don't necessarily really talk to each other a lot, but it's all one giant thing.
And then we have the one to two dozen binaries that were responsible for this whole chunk of systems.
And then we talked to external teams.
Oh, that sounds messy.
That a lot of like our job being on call is what is actually broken.
Because usually the thing that annoys us isn't necessarily the thing that actually is having the misbehavior.
And then, okay, who can actually fix it? Can I fix it? Do I need a developer to fix it? Do I need Spanner SRE to fix it?
Yeah.
And I mean, the idea of component is going to be flexible based upon the size of your stack, the size of what you're handling.
So it sounds like you're sitting on top of a really big component.
And yes, the cool thing is that you build a lot of experience in troubleshooting different types of systems.
So you become a more skilled generalist, you learn a little bit more holistically focused.
You learn engineering, so how do all of the things work together?
You also get to learn some cool stuff about like triage and how to organize others in messy scenarios.
You learn how to command complex situations and diagnose systemic behaviors.
So your team is in a position to give you some real career advantages.
But at the same time, I would imagine it would feel a little bit invalidating if you're told like you should be expert on all the things.
Because it's not possible.
It's not remotely possible for my team at all.
I mean, but it's possible for you because you are brilliant.
Say that with 100% genuineness. No one gives themselves enough credit.
But it is, it's tough.
And in those types of scenarios, you have to be able to review your team's operations.
Comparing notes on what's causing problems, what's going outside the quality control thresholds.
Looking over stuff like postmortems and practice as a bulk.
It's going to be more important to identify patterns from across your whole team of all the problems that you're seeing.
Because you're not just looking for behaviors within one more isolated system.
You're looking for patterns of behavior across multiple systems.
Across multiple systems that sit together and talk together with different dev teams involved.
And being able to coalesce all of those nuances that you've now witnessed into really concrete ideas of what's the best thing to work on now.
That's a position of power, but it's also a lot of work.
Which is why it's important really not to burn your teams out.
Incident response is exhausting.
Holding a pager is exhausting.
And because it's such a human expensive activity, it makes a lot of sense to do it as sparingly as possible.
And really work on prevention, work on preparedness such that incident response isn't as human expensive as it can feel.
Because sitting at the helm of 12 systems, MP, I got to ask, is a single on call shift tiring?
Depends on the day.
We're kind of spiky.
We tend to like, we'll have a week or a week and a half for the pager.
We'll just be really silent and nothing will happen.
And then some hazard will start finally manifesting.
And then it will just wreak havoc for a little while.
And then things will quiet down again.
I feel that launch is never going the way we want it to.
But it's tiring work and it's, you know, it makes sense to do it as carefully as possible.
Yeah, absolutely.
Yeah, lots of complications.
Another thing I was thinking about, not to jump back too far, but when MP was talking about all the complexity and we were talking about different types of incident things like,
you also, I'm just saying this from experience because it's happened to me recently, you also get these scenarios where it may be multiple teams get paged at the same time and it becomes like a bigger thing.
How do you coordinate that?
Or, you know, in my case recently, somebody paged me for something they got paged for and then I paged somebody else.
It can also get complicated too, not just in the general expertise before and after, but while you're there, as you mentioned, having organization and having control of like who's going to do what in the moment is important.
It can be really tricky.
It can be.
Yeah, I don't know what the leading advice on that is, but.
This is where having a really solid incident management protocol that all of your team kind of knows about equally can be really helpful.
So for Google's incident management protocol, we think about like the three C's of managing incidents, which is command.
So making decisions and keeping the team or sub team focused on the same goals, control, know what is going on, coordinate people, be continuously aware in communications.
So taking notes, being clear and ensuring that everybody has the same context.
Being able to fall back on to the same set of things can be really, really helpful.
When you're pulling people in, it means that you don't need to explain who is doing what and how you just see the role title and you understand and you know how to move through things.
So the version of the FEMA incident command system, it has defined roles like incident commander scribe communications and by using like a shared and clearly defined process.
We build really positive emergency response habits, including maintaining active state, a clear chain of command and just overall reduction of stress.
Everyone understands who to go to in an incident and how to hand off.
So essentially like a chess player can't drop a bishop on a majeur table and make sure that everyone knows what to do with it.
You know, in urgent situations, it's important that all players are playing the same game.
So by using a common incident management protocol, you can page in a bunch more people and you're all going to understand where to go, what to look at, how to talk to each other, who's generally in charge of what.
And it's in part the role of the incident manager to make sure that this information maintains in a rigid, concrete kind of state as folks move in and out of the incident.
It's maintaining that protocol, maintaining that incident management structure will help to drive clarity through the escalated state.
And when things get like really big and really messy and they totally do, there isn't often enough mental space for like one incident responder to work out the appropriate mitigations and coordinate implementing those mitigations
and communicating to everybody that's dependent upon the system and manage expectations.
Bringing in more people is helpful. No one can do all that on their own.
So having like a structure that you can fall into, maybe like some standardized tools or plans that you leverage in order to track how this protocol is going.
These are some of the tools that are going to allow you to move smoothly as an organized unit through really, really messy times.
So when you get that like deer and headlights response, sometimes even a sloppy first step is be like, all right, I'm the incident commander.
I know how to do incident commander. What are the things that I remember to do?
Okay, let's start by broadly checking things out for context.
That's an incident commander role.
And hopefully you'll be less awkward than I am when explaining it, but it gives you a little bit of muscle and a little bit of habit to fall back on.
You know, it's playing a game that you're very used to that you know the rules of.
And in that messy scenario, taking the first step forward is going to be the most important step.
Yeah, thanks for diving into that a little bit.
So if there's one lesson or concept that you could instill in the hearts of all of our incident responders, all of our on callers, our pager holders, our SREs, what would that be?
What's your number one takeaway for incident management that everyone should try to internalize?
Everyone should try to internalize.
Incident response is a really human expensive activity.
Multiple people need to be involved in driving an incident from initial alerting to resolution.
And the act of incident response is to put a mitigation on either that root cause, that hazard, that problem scenario while it's happening in order to buy some time to make judgments about priorities.
And it's disruptive, it's tiring, it wrecks all of your normal plans.
But by thinking about incident management as a broad life cycle by working on preparedness, by working on recovery, these are some of the things that over time can increase the amount of time between incidents, maybe reduce the frequency of incidents and build a more robust system.
So I guess my number one takeaway is do as little incident response as possible.
Focus on great engineering, building really sound products that can handle a wide variety of user behaviors.
Build in reliability from the ground up.
Use it sparingly.
Avoid burning out your team and start doing that engineering work needed to fix longer running issues or risks.
You have a lot of tools at your disposal, so make it happen.
Thank you.
I will definitely keep it in mind and I hope everyone else will too.
So it was really great to have you on the podcast today.
Thank you again so much for being here and sharing all of your advice and tips.
And as you mentioned, you do have a book out anatomy of an incident.
And if I'm correct, you wrote that with Ayelet, right?
Yes, and she's the coolest.
Et c'est disponible sur le site Google SRA, sous-resourci.
Et si vous considérez la download, ma mère va être vraiment heureuse de moi.
S'il vous plait, je vais tout downloader, mais prétendant être votre mère.
Mais oui, tout le monde, s'il vous plait, je vous le dis parce que Ayelet sera notre prochaine et final guest sur le podcast.
Donc pour nous faire rassembler avec Postmortems.
Donc, merci encore.
Et nous allons vous faire rassembler la prochaine fois avec Postmortems et pour s'en terminer ce premier saison du podcast.
Et j'espère que je ne dois pas utiliser quelque chose que j'ai appris aujourd'hui, alors que je suis mon appel à la semaine.
Oh, oui. Mais restez à l'intérieur. La salle de l'outil est scère.
Merci beaucoup.
Oui, peut-être que les pages soient sur le site et les barres seront très très loin.
...

Episode suivant:

Postmortems with Ayelet Sachto

Les infos glanées

Je suis une fonctionnalité encore en dévelopement

Signaler une erreur

GoogleSREProdcast

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

Card title

Lien du podcast

[{'term': 'Technology', 'label': None, 'scheme': 'http://www.itunes.com/'}]

Go somewhere