On-Call Rotations with Andrew Widdowson (APW)

Durée: 43m58s

Date de sortie: 17/05/2022

Andrew Widdowson (APW) shares strategies for successful on-call rotations.

Visit https://sre.google/prodcast for transcripts and links to further reading.

View transcript

Hello and welcome to episode 7 of the Google SRE podcast or as we affectionately refer to it, the podcast.
I'm your host for today, MP. And here with me is Viv.
Hi.
Here with us today, we have someone who I think the vast majority of SREs at Google have had the opportunity to either hear speak, have received teaching from him and generally a really well known name around SRE at Google.
And he's here today to talk to us about on call. Andrew, why don't you go ahead and introduce yourself?
Hey, thanks so much for the warm welcome. I'm Andrew Wittowson and most people at Google know me by my username, which is APW.
So feel free to call me Andrew or APW.
I've been an SRE at Google for coming up on 15 years now. Wow.
I've enjoyed the entire time. I spent over a decade working on our Google search systems at all front and back in infrastructure.
You name it. I had looked at it.
Then I spent a considerable amount of time about four, maybe five years working on SRE education through the SRE EDU program.
And these days, I sling a bunch of data in a slightly less than typical SRE context.
I work on the global resilience of our physical Google offices and a bunch of data analysis about how our employees are doing.
So there you have it. It's been a wild ride, but I'm glad to talk to you today.
Yeah. So on call is something I think a lot of people in operations systems administration are pretty familiar with.
Mm-hmm.
Why would you say it's SRE's job?
It's always kind of a little bit of a funny thing that we're usually the point people for these emergencies for these huge complicated systems.
But we're also not usually the experts in how any particular piece of these systems works.
Yeah.
Well, let me set the stage a little bit.
So Google is a massively large company these days, but it has always been the case at Google that there have been more software engineers in a general sense.
Working on products, then there have been SREs.
We are, if you will, naturally scarce.
So while I agree that in many cases for some of our most public facing, high revenue, etc., high risk sorts of products that SRE should be on call or co on call for a service.
Yes.
The other part that I just want to mention that's a part of our leverage, a part of our scarcity and a part of our selectivity is that the overwhelming majority of what's called the microservices at Google do not have SRE on call for them.
So we pick and choose not our battles, but our responsibilities.
And you can think of the fact that at least at Google, if an SRE is on call or an SRE team is on call for a product, that means that there's a certain extra standard of reliability being afforded to it.
But we try not to hoard that for ourselves.
So even in many of those on call rotations, we are co on call with our developers.
Just wanted to clarify that off the top, right?
But why should a necessary be on call or an SRE team be collectively on call for a service?
I think it's because we spend an extra, not necessarily disproportionate, but a necessarily proportionate amount of our time thinking about the art and science of reliability rather than just doing the things that make it reliable.
So we are both on call and we're thinking about how to make on call better.
Whereas the mindshare of many other software developers who may be themselves self on call for a service don't necessarily have that mindshare or that amount of time allocation to do that.
So I don't know, I think we're meant to be exemplars of on call and we will both do a great job at on call and, like I said, make it better.
It gives you a really visceral sense of the health of your systems when you're the one carrying the pager for them.
Absolutely.
I want to be careful about how I say this because sometimes our internal phrasing can sound a little snarky.
But I think it's important that our developer colleagues who don't first identify as site reliability engineers.
I also identify as a developer mind you, but I identify as an SRE first.
Our developer first counterparts need to be able to quote feel the pain of the service, the operational aspects of it in order to be motivated to make it better.
Because otherwise it's just throwing stuff over a wall, right?
You know, here take this person who isn't really an SRE.
In my mind, SRE is about having the setup with your developers, the mutual respect to say we're going to operate this together.
However, we may call you in for the big things because we know that you are a calming force and that you are especially, your brain is especially predisposed to thinking through the on call issues.
So, you know, you're not in trouble if an SRE shows up.
The Calvary is here.
You know, the extra help has arrived if an SRE is here.
I like it.
So speaking of feeling the pain of your service going wrong, right, which I think you just mentioned.
I think sometimes there's a perception that on call is just like really painful.
It's like, oh, like I have to carry the pager, like all these things are going to be terrible.
Like it is like this burden that is part of my job versus like it's a perk of my job.
I guess there's those different ways to think about it.
What do you make on call a positive if it is a positive and how can it be a positive if it's not?
Oh, for sure.
The following might sound a little bit exaggerated, but I firmly believe it.
And that is that on call is you get to be in the driver's seat.
If you want to pour race car analogy, which I'll neglect to fully flesh out, you get to be in the driver's seat for a product for your slice of time.
And so yes, if it turns out that you're okay, bad analogy is a bound.
If you're in a driving, if you're in a race car, but it's constantly backfiring, that's, that's not going to lead to a fun lap around the racetrack.
But if you're there and you're able to see the machine sing, right, if you're able to get the car to really hum and you use it efficiently, you know, like you come off the track saying, I can't wait to do that again.
So for me, I look at on call as it's a temporary responsibility that I enter into, along with everyone else who's in my on call rotation to make sure that we have the biggest non event of my shift.
So, but that being said, let's go straight into the painful on call shift.
So, again, the following is a cultural aspect of how Google treats its production operations.
And it doesn't necessarily mean that this is right for everyone, but we have, as an homage or an honor to one of our almost founding members of the SRE org, Ben trainer Sloss, Ben Sloss.
We've named like a fatigue limit, which we've incorrectly called the trainer limit, his last name is Sloss, we should call it the Sloss limit, whatever.
We have this, this limit named after our grand booba, which says, if you get more than a certain number of incidents that happen per shift, then that's a sign that your SLAs may need to be adjusted, probably do need to be adjusted.
I mean, it's, it's not the case like, oh, I got page 10 times today, and that's over a threshold of let's call it three.
But this has only ever happened once we should take radical action.
No, of course not.
But if you look over some smooth window of time and you realize that the amount of time you spend doing follow through on your incidents, on your outages, which there's always a certain amount of follow through rate, helping to co author post mortems or seeing a fix through to production.
If that is over a certain threshold, you will end up in some sort of cumulative failure, like a stack of dominoes falling over where you're like, oh, but I can't write a post mortem because I'm getting paged again.
Oh, I guess I can't write the post mortem.
That sort of thing is really bad.
So we look at least at Google with the thresholds that we've set.
And again, they're what's what even is an incident.
It can be the measuring in the math that we particularly uses less interesting than the philosophy behind it, which is the SRE or really any production oriented on caller has a certain mental fatigue limit that we both don't want to get up close to, but we want to very carefully defend ever getting close to sustainably so that people can say, I'm not depleted by my on call shift.
Instead, I'm intrigued or refreshed best case, or nothing happened.
I was bored with my on call shift.
That's okay too.
So anyway, that's kind of my take on making sure that you have a sort of balanced thing and having the managerial and cultural support to say this on call rotation isn't sized right for the number of incidents or the SLA that it has or the number of people involved, etc.
That's an important aspect of making sure that on call is the sort of thing you don't run away from.
Yes.
So something I've noticed that is not a standard across teams, but is a common practice is to actually have both an on call and an on duty rotation.
How does this relate to that limit?
Sure.
So just to be clear, let's let's define what we mean by on call and on duty generally.
So I think on call is you are responsible for the vitality of the uptime, the responsiveness of the service during your shift.
And on duty means some sort of probably small quanta, but maybe high quantity of work that needs to be done crank turning answering tickets, answering support, whatever type stuff.
And sometimes, mechanically, some teams will make the on color also be the on duty person because there are large swaths of time where you're maybe not getting paged for your service.
Like the goal here is not to get you exactly one page per 12 hours or something.
It's to be less than, which could be zero incidents, pages, etc. per shift.
So that's just an optimistic overloading.
But I really do want to call out that I think the skill required and the level of thinking exercised, not the level, the type of thinking exercised for on call and for for on duty could be any number of things.
But we'll just say it's very different thinking.
One is more like I'm going to do a repetitive thing or I'm going to follow some decision tree repeatedly and just make a bunch of copies or whatever it is that you do when you're on duty.
But for on call, it's much more about as our lead of SRE education, Jen Padoff says, being stewards of the scientific method in a pressure cooker, at least while you're being paged.
So the only thing I would guard against, if you were to ask, would be making sure that on duty does not drain the on caller so that when they do get paged, say towards the end of their shift, but while they're still on call, and they go, man, I just, I just did a bajillion thousand tickets and now I'm getting paged.
Oh, my head hurts.
Like that's that's not setting a team up for success.
So some teams, like I said, optimistically overload on call to also include, like while you're here, do on duty, but others keep it separate.
And I really, it comes down to how capable are you and your team of being simultaneously on call, even if almost quote unquote, nothing happens, and also doing some other element of day job work.
So like, I particularly like coding while I'm on call.
And then I'll just make sure that, you know, I write copious amounts of structural documentation as I go.
I often kind of code comments first.
So I've set myself up to be kind of a fast.
I can resume my coding quickly because I always bookmark where I was, if you will.
For some people, they're like, man, I can't get anything done when I'm on call.
So just sure, give me the on duty.
I don't know why I'm using that voice for man, I'm so overloaded, but it really does.
It depends on the sort of work, right?
You know, you can have on callers who are double doing as software engineers, you can have on callers who are double doing as technical program management.
And so they may have different skill sets in what they do when they are not being paged.
So if you can have, say, two people splitting the work of like on call on duty, right?
You know, you're saying it could be one person, it could be two.
Could you have two people who are on call for the pager at the same time?
Like, what is, what is going to the people distribution look like?
And are there some guidelines on setting up your rotations?
Absolutely.
So let me first answer the, can you split being on call at the same time?
And then how would you construct maybe the demographics and distributions of people involved in an on call rotation?
So keep me honest, let's bookmark that.
En général, I consider on call to be both a symbol and a responsibility.
So in the responsibility part, it's like, oh no, we're spewing 404s on our front end or 500s or something.
And you got to go figure out why it is and make it better, faster, more like it never happened.
But it's also kind of a, like I said, a symbol, a lightning rod, if you will.
So if you are the on caller, you are kind of being given a token that says you are the decider for the reliability of this product.
At Google, we have a saying which is when you're on call for Google product, you have temporary equivalent authority to a vice president.
So, and that's, that's really true.
Like you may need to make the decision a part of being on call is deciding and you may not be the most.
Knowledge person in the decision space, but you are the most present and most stateful decider until such time as someone can augment you at your request.
So this is a indirect way of saying to your question.
What if you split the work?
Okay, so it's my personal preference to have on call and on duty be completely separate because I think it gives more agency to the on caller, the single on caller to say,
I choose to do whatever I do when I'm not being paged, whatever is best for my work and day job, as opposed to I'm being told, please do tickets or this or that.
But if you were to split on call by say having two on callers, regardless of how on duty is split, I think the ability to respond to an incident may be slightly improved, possibly.
But the authority agency visibility of who is the on caller, I need to talk to the on caller is reduced because you now have two people.
You have maybe heard of phrases like too many cooks in the kitchen, or if a problem is everyone's problem, then it is no one's problem.
Or the classic, if you don't get what you want from one parent, you ask the other.
So needless to say, I am, for the most part, in favor of single person being the primary on caller.
And if you have a second person for extra support for relieving for breaks for you need to go to the grocery store.
I also recommend a secondary, but exactly that a primary and a secondary rather than two co primaries and no secondary.
That's my personal preference.
I think it solves for efficacy of communication and it also solves for agency of the on caller.
Yeah, I think about it, the thing that would throw me off the most to trying to have two primary on callers is, well, how you're splitting, who's taking what page and doing that in some kind of an equitable way.
But even then, once you start, like if you're just kind of going back and forth, just alternating, there's a visibility loss there potentially that as one page can be the consequence of the thing that happened 15 minutes ago and isn't actually a separate problem.
Yes.
The other argument I would make here is, if you were to be so unlucky is to have two, two problems coming in, in short distance from each other.
And it turns out that they are in fact completely unrelated or mostly orthogonal.
Having your secondary remain mentally fresh so that when you get paged again and you look at the thing, you go, oh, that's not related to this at all.
Dear secondary person, would you please take this?
Then this also reduces the cognitive burden for each of you.
I think that's kind of related to what you're saying.
Mm hmm.
Yeah, definitely.
So I think the second part of your question, if I'm hearing it right, was about how would you design parts of on call or would you like to throw that question back to me?
I just want to make sure I understand.
Sure.
Yeah, I was just asking about the rotation since we were talking about how you might staff it, other parts of the rotation.
So maybe there's more opinions on staffing, but also how long is your rotation?
What does it cover?
I don't know.
I know I'm throwing more questions at you in response to us bookmarking a question for later.
This is great.
I welcome all the questions.
So hear me out, fair listener, when I say the following.
Because again, same disclaimer as before, this is just an overview of ways that Google has chosen to solve things and size things.
And so if, if you hear something in the following that you feel like could never work for your company or team, that's okay.
But hear me out on some of the different tradeoffs and things that we've evolved into.
So for one thing, we tend to pretty heavily staff our SRE teams.
So we have an internal minimum standard of having six or maybe seven.
I've lost track people on each of two sides of an ocean, as it were, two geographically very different sites must each have at least six or seven people in them.
So that would be cumulatively 12 to 14 people at minimum in order to have an officially SRE funded on call team, team and on call rotation.
And so that has implications for all of the rest of the math that we're about to do.
You might say, I'm a team of three, this doesn't make any sense for me.
That's totally fine.
But hear me out on some of the additional timing that we do.
It is on a SRE team by SRE team.
Well, for that matter, any on call team basis to determine how much time and on caller spends per shift, how many shifts you have per day or per week, etc, etc.
But of course, and this is an important thing to flag, it is the responsibility of anyone making an on call rotation to comply with local laws, local employment laws.
Regarding work, work hours spent and so forth.
And so when I say the following, I mean this for a majority of teams, but there are always exceptions based on where some of our on callers are based.
Some teams choose to have a, let's call it seven unique days a week, like a one day of on call, then further split into say 12 hours.
So like 1212.
And what this might mean is let's say that you have different people on your team and I'll call them ABCDE.
Those are all people.
It might be that on day one, you have 12 hours of A and 12 hours of B.
And then on the next day, you have 12 hours of C and 12 hours of D where A and C are in one continent and B and D are on another.
So that's, that's a case of a dual homed on call rotation.
That is 12 hours within each 24 hours.
Okay, so that would mean you could have conceivably up to 14 different people a week.
Of course, you know, some people may choose to coalesce and say, I'm going to take Monday, Wednesday, Thursday, or, you know, I want to take Friday, Saturday, Sunday.
They could choose to do that by horse trading with their colleagues.
But the idea is you roll the dice.
Well, in advance, and you could have a completely new person for tomorrow's 12 hours in, let's say California time than you did in yesterday's 12 hours of California shift or whatever.
Some other teams have much more of a congealed or consecutive sort of basis.
So they'll say, okay, well, we're still going to do 12 and 12 or maybe we move the divider between the two on call teams because one is in a slightly different time zone where it's a little bit harder to do this.
So we're going to do, we're going to have some sort of a mercy shift, which is like we do 10 and 14 because of whatever the case may be, you know, sometimes you don't control what.
Nation, your second on call team is hired in because it was a matter of your company's staffing priorities, let's say.
So regardless of whether it's 12 and 12 in a day or it's X and Y that sum up to 24, maybe you do formally say, okay, we're going to have the same person be on call from North America during their daytime plus or minus for seven days at a run.
So that's much more of a different end of the rails sort of set up.
So you say, I'm doing an on call shift for seven days, 12 hours a day, and I have a colleague who's also doing seven and 12.
And there are also variations somewhere in the middle between these two that we've seen as well, which is, for example, not doing daily, not doing weekly, but doing something like either over the weekend plus Friday or Monday.
So like, let's call it Friday, Saturday, Sunday, and then having a entirely during the work week, you know, Tuesday, Wednesday, Thursday sort of set up.
And by the way, I say that with a North American sort of view on the work week, right?
You can imagine modifications for cultural norms in certain nations, right, in certain countries, specifically around maybe days of Sabbath or employment law, etc.
But there are other variations as well.
So like some people say, we prefer on our team to know that we always have the same hours of the day that we're going to be on call for.
So give me whole days.
So like Friday, Saturday, Sunday was my previous example versus Monday, Tuesday, Wednesday, Thursday.
Or some people say, well, actually, there's a benefit to having like a midday handoff just before a weekend or just after a weekend, because we want to acknowledge the fact that sometimes change in our systems.
In fact, very often times change in our systems, which could be reliability impacting changes occur due to, well, humans and humans are diurnal creatures who are awake or asleep and that have, let's face it,
Very different behaviors during workdays versus maybe weekend or rest days.
And so some people say, let's do a Monday to Friday, but splitting it halfway through what would otherwise be a shift so that I get half of Monday through half of Friday and someone else gets half of Friday through half of Monday in each of our respective 12 hour shifts.
Honnestly, I think the difference between like a half seize Monday, Friday split versus a Friday plus the weekend and workdays minus Friday split is minimal.
I think it may be over optimization, but honestly, if there's a thing that resonates with your team or with your org and they would prefer to do that, give them that choice, you know, plus so long as you have systems that allow you to carefully trade shifts for people so that they can further micro optimize for mutual benefit amongst
pairs of people who want to do each other favors, you're going to be okay.
The last thing I'd advise for you to do however is just to say, there is a robot that declares when people are on call and it's all going to happen and you can't change and deal with it.
You have to acknowledge that there are humans all throughout all of these processes and we want to optimize for their happiness and their sustainability to want to come back to the on call rotation.
Part of when I came to Google, I was actually part of a single site on call rotation.
Yep.
That had a 24 hour pager holding.
I can't remember exactly how we split the weeks, but it would be like multiple consecutive 24 hour periods.
Sure.
That we'd be holding the pager for.
And I'm sure there are organizations out there that don't have the ability to have dual-cited teams.
Absolutely.
So what would your recommendation for them be?
Well, to the extent possible, I think the most important thing we need to keep in mind in any on call design, regardless of whether it's split within a day or not 24 hours a day or 12 hours a day.
This, that is our humans are on callers are our most precious resource and their freshness, their vitality, nevermind the services vitality is most important.
So for example, one day I came into work when I was, it was like my third or fourth week on the job, and I came in to find and of course this is going to sound like some Silicon Valley stereotype.
Please bear with me for a second.
I found the engineer that I knew the most my mentor sleeping in a beanbag.
There's the Silicon Valley stereotype, but take from it what you will.
Well, when he woke up later, I asked him, I said, like, is that a thing that's allowed?
Like, can we sleep in the beanbags?
And he said, well, I got paged late last night.
And so I still came into work today because I'm on call a little bit later today, but I was doing the company a favor last night and staying up later to make sure that I saw this through.
And I know that if I get paged, it'll page me in this case, in what would otherwise be my business hours daytime.
But I'm going to take a bit of a nap so that I can be much more fresh in case I'm not paged until later today.
I want to do better at that page.
So yes, be easy on yourself was the directive that he gave me.
And I remember that to this day.
But maybe the general lesson to take away from this is if you're going to be doing an all day and all night on call rotation.
I don't think it is sustainable personally to be paged multiple times in the middle of the night for multiple nights if the type of work you are not doing is shift work.
Like, I know certain classes of engineering work are like, oh, I'm going to roll on to the night shift and I'm going to roll off.
That's a different story.
But if you are a quote unquote daylight hours worker, 40 hours a week, whatever the case is, but you are also on call, the only thing I would ask of a management structure in that is to have compassion for the family.
The fact that if people are woken up in the middle of the night multiple nights, they're not going to be at their best for later times.
So what might be a compromise here, I would maybe suggest that maybe there's a policy like if you get paged in the middle of the night on multiple consecutive nights that there's a mercy substitution, like you can avail of your colleagues to see if someone will take over the rest of your shift,
knowing full well that most of the time, if things are sized right crossing your fingers.
This won't happen, but in the case that does we let you tap out is essentially what I'm saying.
That would be my suggestion.
But again, given that that is not how things work at Google, I'm merely speculating and I wish everyone the best of luck in figuring out how the size shard and trade their own call rotations.
I really like that you said that the people come first.
Yes.
I'm just using myself as a reference point, but you know, when I first started with on call, I was really nervous about being on the rotation.
You don't want to let the team down.
You want to make sure you're getting to everything.
And I think like it is a good reminder that you have to make sure you are in your best spot, which ultimately will help the rotation to exactly.
It's not like we're like welcome to SRE.
We're going to burn your amygdala out from stressing you out for multiple days on end, and you're going to start smelling colors.
That's not what we're here for.
It's the mythos or the kind of the reputation that we put around SRE at the risk of over mythologizing SRE, which is itself dangerous.
But what we generally tell people when they join the SRE or Google is welcome to this specialty role.
It's going to require you to grow as an engineer.
If you were a software engineer, now you need to learn some more operational things.
If you were more operational oriented, you're going to learn more software things.
You will become a hybrid role, but not only will that make you a better engineer, we believe, but also it means that we will use you more selectively for the better good of Google products.
Et donc avec cela, nous allons faire des investissements plus grands dans vous et dans votre sécurité psychologique, et bien sûr, nous allons vous dire bien dans tous les autres respects.
Mais pas seulement vous, vous ne vous en serez pas, vous seriez avec un équipe de personnes qui seront vos collègues de collèges et qui sont responsables de faire la responsabilité de votre produit meilleur.
Parce que c'est essentiellement, tout le monde a l'air d'être disait de manière spéciale, mais on essaie de l'embodiler dans ce que nous prenons des gens.
Parce que nous savons que il y aura des temps très difficiles et difficiles, et je ne sais pas l'answer.
Et donc, on essaie de pouvoir les gens savoir que vous êtes le meilleur personne pour le travail à la fois.
Les décisions seront faites et les systèmes seront plus stabilisés.
Mais on ne va pas vous déterrir ou vous laisser sur un collège parce que c'est ça.
Et en regardant les questions de staff, quand c'est temps de faire un nouveau tour sur le téléphone,
dans la structure Google, vous devez dire que l'organisation d'enginéries a décidé
qu'ils pensaient que c'était bien, qu'il était bien pour les investissements, pour avoir des SREs pour ce secteur.
Et il y a maintenant besoin d'être un nouveau tour sur le téléphone qui n'a pas précédemment existé.
Comment vous avez appris de ce secteur où les développeurs sont les keepers de la connaissance
et de toute l'information sur comment le système fonctionne, et de aussi avoir ce SREs qui est prêt à répondre à une émergence.
Oui.
Donc, il y a beaucoup de différentes manières pour les Bars, ou les Bootstrap, si vous voulez,
et pour une rotation en collège pour un produit.
Et surtout si vous allez d'une rotation en collège qui était en collège de la connaissance,
et qu'il n'y avait pas de SREs qui était en collège,
où les développeurs peuvent encore se faire en rotation en collège,
mais la majorité de temps qui est attendue par les SREs.
Il y a un couple de différentes manières pour faire ce qui va être un plus sustainable,
plus malin, plus mieux, plus vite que les résultats, c'est-à-dire.
Donc, vous pouvez me faire une picture d'un design de plus en plus de choses que vous pouvez faire.
Piquez de cette, n'importe quelle récipe, et je pense que ce sera mieux que de faire un tour sur le lit en collège
et de faire un tour sur le pêcheur.
Donc, peut-être, avec le pool talent que vous avez,
peut-être que vous avez déjà un set de plus en plus de SREs ou de la team,
et que vous avez identifié les plus seniores des gens dans ces teams.
Mais ils sont en contact avec des produits complètement différents que votre company.
Comme on dit, vous pouvez t'attraient sur de nouvelles expériences,
ce n'est pas seulement un pro-advice dans le spécifique mental,
c'est comme dire, ne vous mettez pas dans le hot liquid,
ou vous vous coubliez le mètre, c'est un truc.
Si vous pouvez,
essayez de cider un nouveau engagement de SRE
avec une ou plus de seniors qui ont déjà expéré des SREs,
même si ils ne sont pas dans le même domaine de produits,
comme hypothétiquement, on se dit que nous devions faire un call sur Google Maps
pour le premier temps.
Donc, quel est le personnel que vous portez au point de faire un tour ?
Ils ont eu l'expérience avec un call et une reliabilité.
Donc, si vous avez ça, je commencerai avec ça.
La prochaine suggestion importante que je voudrais avoir,
c'est que vous n'avez pas une expérience infinie
de supply de vétérans qui ont expéré des SREs
pour les staffs de vos nouveaux rotations.
Et que, si vous n'avez pas un deuxième ou troisième,
et que vous n'avez pas un quatrième ou cinquième,
vous avez quelqu'un qui est complètement nouveau à la SRE.
Mais vous vous inquiétez parce qu'ils sont un penseur intelligente,
ils sont un ingénieur dans le making,
ou ils sont un ingénieur dans leur expérience,
mais ils sont un SRE dans le making.
Quand vous avez à l'aise des personnes
qui ont appris les produits et le rôle de SREs,
simultanément,
c'est important, dans mon opinion,
que vous ayez une très bonne sécurité
entre vous et les autres.
Ce que je veux dire c'est l'éducation,
et de pouvoir marcher et poster vous-même
comme un SRE,
de pouvoir émuler la pensée,
la pensée scientifique qui va être un SRE.
Tout ça est important.
Et la dernière chose que vous voulez faire,
c'est de raser quelqu'un dans un vacuum.
Je sais que nous avons utilisé la phrase
« cargo culte engineering » en ingénieur,
je pense que c'est une phrase très importante,
mais beaucoup de vous connaissent ce que je veux dire.
Nous ne voulons pas que les gens
réinventent un SRE dans un vacuum
12 heures dans le futur
d'un autre des collègues.
Parce que, au bout du jour,
en pensant sur ce point de vue psychologique,
les SREs sont ingénieurs
qui sont déjà rapides infirmer.
Les ingénieurs, vous savez,
ils tentent de la measure un couple de fois,
les couper une fois ou deux,
les faire un meilleur bruit,
tout ce que les ingénieurs font.
Mais les SREs, spécifiquement,
sont d'accord pour être élevé
parce qu'ils sont, dans le bon sens,
des rapides, des décisions de décision.
Et donc,
ce n'est pas seulement appris
à comment ils se souvient
que nous sommes en train de servir
les 500 en face de la frontière,
mais aussi, c'est pas seulement des challenges techniques.
Ils sont des critères de la bière
quand il s'agit de choses comme,
« On croit sur les collègues,
je ne peux pas détaché de travail parce qu'on a quelqu'un,
et qu'est-ce qu'ils se font pêche et qu'on doit toujours
continuer de faire la fuite ? »
C'est une manière très longue de dire.
Je pense que c'est très important
que vous ayez une éducation
d'éducation en partie
d'un groupe de tous les existus,
pour élarger le prochain set de SREs
que vous avez élevé et poursuivre
le groupe pour tous les bénéfices.
Pas seulement est-ce que vous accélérerez
la journée pour le nouveau-mère,
mais il y aura aussi une increase
de confiance entre les autres
de l'équipe.
Ideally, vous n'avez pas un
style de classement,
style de classement, SREs,
non, vous ne pouvez pas le faire à la table.
N'est-ce pas, certainement.
Mais vous voulez que les gens
puissent voir que vos plus grands gens
succèdent ou perdent
un exercice théoretique
plusieurs fois, avant de les
prendre sur la surface de contrôle,
et avant de se faire solider en appel.
Donc, quand je dis ça,
ma réception, pour
une SRE en appel
est claire. Le premier
devrait être très expériences,
et le point que je fais
par cette explication est que le second
devrait être quelqu'un qui est
relativement seniors, mais est
bien aidé à
leur apprendre l'art
à quelqu'un d'autre, et à l'aide de
faire plus de la poignée culturelle
de normes envers votre équipe.
Et puis, le troisième, les plus grands
peuvent être
seniors talent, si vous avez le temps,
ou juniors talent, si vous voulez le construire.
Mais, de toute façon, vous le découpez.
Vous devez être sûrs
que vous avez des gens qui sont
bons à normer le équipe
comme vous levez. Il y a d'autres
considérations en place, aussi,
d'une fois que vous avez un
24-hours de rotation, ou
une 12-12-hours plus ou minus.
Comment vous vous faites
surement que
vous êtes douloué en courant,
que les sites se sentent
équilibrées et non dominés
par l'autre, ou que les décisions
sont faites en même temps que
les développeurs, ou quelque chose comme
ça. Il y a beaucoup de ça.
Mais, en tout cas, en enjeu
de la courant, je regarde
des étudiants

pour les super-seniors,
comme les exemplaires de l'art,
puis les étudiants
qui sont souvent aussi
exemplaires de l'art, et qui,
en général, sont, et puis
un mélange de gens de différents
types de backgrounds dans tout ce
qui est de la santé. Et, de suite,
je vous encourage de vous dire
que c'est ok si un SRE
vient d'un produit de manière
plus importante.
Si tout ce que nous faisons, comme nos
biologistes, ça s'adresse à
l'hébride de la vie. Donc, c'est
bien si vous avez un personif
en avant-end orienté et un
personne en avant-end orienté
en travaillant dans le même SRE
et, par contre, le nouveau SRE
est un layer de storage qui n'est pas
en avant-end ou en avant-end.
C'est en fait l'une des meilleures
cases pour l'awentualisance,
robuste de l'awentualisance.
Vous pouvez vous faire un petit plus
plus détaillé sur la création d'une
opportunité pour tout le monde
sur le team pour
construire leur propre confiance dans leurs
habilités de teammates.
Oui, et en défaisant d'un exercice
théorique, plutôt que
sur le produit réel, oui.
Oui, pouvez-vous parler un peu plus
de ce que ça pourrait devenir pour un team?
C'est en fait un des choses
mes préférées pour parler. Je suis
très heureux d'être
ici. Partage de
être dans ce tribe, si vous
voulez, de la SRE est la culture
d'awrale. Partage de ça
c'est les lessons que vous avez
appris par vos prédécesseurs
et que vous avez
des collègues.
Et donc, une des meilleures
options pour avoir
une bonne mémoire de muscle
en dédiagnant les problèmes,
en enversant le method scientifique
et en faisant
un appel, c'est de
pratiquer le métier dans un
environnement de la salle de noeuf
ou de la salle de noeuf.
Dans certains circons,
surtout dans les circons
d'un type de table top
avec des masters de jeu qui
vous renseignent
dans une aventure multi-player.
Vous êtes dans un
maise de 500 codes d'arrêt
et consoles de Prometheus.
Qu'est-ce que vous faites
après? Je vais à la gauche.
OK, qu'est-ce que vous voyez
là? Si vous avez
votre propre
game multi-usernam,
c'est un environnement

D'autres professionnels de sécurité
vont aussi utiliser le mot
table top exercice.
C'est comme, on va voir
ce qui se passe si on a un certain
pays invité à un autre pays.
Qu'est-ce que nous faisons?
Qui vous appelle?
C'est un truc.
À Google, on a ce que nous
appelons, le boulot de la misfortune,
qui est juste une façon
de dire que nous faisons un table top.
Et ce que nous faisons
est que
beaucoup de teams ont un
week-end ou un week-end
qui va se gérer,
ils se n'aiment pas le boulot
de la misfortune ou quelque chose d'autre.
C'est un table top
où le master de jeu
a été pensé,
depuis le dernier jeu,
de toute façon,
d'une histoire intéressante
et une path diagnostique
qu'ils veulent
socialement normer
pour les autres de leur équipe.
Ils veulent les inocculer
pour le, je n'ai jamais
pensé à cela avant,
problème.
Ils vont donc
venir avec des conditions
de tricot,
des actions qui doivent s'occuper
afin de
sauver l'univers
ou sauver un produit
d'un certain roulant.
Donc, peut-être
que la whole équipe
va se faire
dans un grand salle
avec un boulot
ou un computeur
qui a un copier synthétique
de votre service,
un stack.
Ils vont dire,
ok, donc aujourd'hui
c'est Sally.
Ok, Sally,
bienvenue à notre
bravun caller pour aujourd'hui.
La backs-up
sur le caller aujourd'hui
est moi.
Vous vous êtes payés
parce que vous avez
la suivante message
et c'est quelque chose
authentique
pour comment vous êtes
alertés et que les systèmes
de monitoring
du company.
Et peut-être que vous n'avez
des screenshots
de quelque chose réel
ou peut-être que vous vous faites.
Peut-être que c'est
tout verbal.
Donc, la personne
dit, ok, Sally,
vous vous êtes payés
pour les 500
d'Asiapacific.
Où allez-vous
valider ça?
Qu'est-ce que vous cliquez
et ce que vous voyez?
Et peut-être que Sally
dit, ok, donc je vous
ouvre notre dashboard
Prometheus
et je
filtrerai ça
pour que je ne regarde
que l'Asiapacific region
et que je regarde
ce que les clusters

sont en train de faire.
Qu'est-ce que je vois, Andrew?
Et comme la maître
de la game, je dirais
que vous voyez
un peu de
ce qui est très
cyclique,
de la trafication
dans les derniers 8 jours.
La latence
semble être normal.
Mais il y a
ce petit spike
ici
dans le
très spécifique
cluster que nous avons
rentré
en Indie.
Oh, dis-moi plus.
Et donc il y a
une conversation,
maintenant, tout le monde
dans le sénateur
est en train de
entendre ça.
Et je pense
qu'il y a des
très différents
sortes de processus cognitifs
qui se passent.
Donc, pour exemple,
une personne très
senior va être
comme, oh, Andrew
a donné
les bonnes 500 d'arrêts
et un trick de la Asiapac
je pense que c'est
d'une question
qu'on a 3 ans
et qu'il fait
un rerun.
Mais peut-être
qu'il y a un
junior
et qu'ils vont
dire, oh, je ne pense pas
que c'est
le même moyen que
Sally en pensant.
C'est cool.
J'ai appris
quelque chose
d'où Sally
va.
C'est un truc de sorte.
Maintenant,
n'oubliez pas
que la volunteer,
cette personne salarienne
qu'on parle de,
c'est ok si elles se sont
trottées.
Parce que ça
va aussi

au nom de la tribe.
Si vous ne savez pas
l'answer, c'est ok.
Ce n'est pas
ok, maintenant,
permettez
l'embarrissage ritu
et soit trotté
par la team.
Mais,
c'est ok,
on va m'en parler.
Si quelqu'un
d'autre a une suggestion,
la maire de la game
est très bonne
pour que elles ne sont pas
toujours en train
de se faire
les questions
dans les 10 ans.
Vous
vous sentez
cette
rigoureuse
pensée
scientifique
dans votre team.
Bonnes points,
si,

votre maire de la game
fait quelque chose
en avant.
Par exemple,
la team va
être en train de
s'en parler pour un nouveau
feature,
ok,
c'est le team
de maps.
Il y a un nouveau feature
sur la route
où vous allez
pouvoir
aller en direction
par skateboard
par walker
ou par biker,
on peut dire.
Peut-être
une personne
sur la team
a travaillé
avec les développeurs
sur la feature


Et ils disent
que la suivante
scénario
arrive 3 mois
en futur
quand on s'en parle
pour la back end

et les gens disent
je ne sais pas
ce que c'est
la back end
Très bien,
on va apprendre
ensemble.
C'est
une bonne manière
d'y construire
la trust
la mémoire
plus de neurons
pour une inférence rapide
dans la
décision
de maire
que vous serez
face
quand vous êtes
en train de

Et
un niveau
d'hospéritie
pour le futur.
Je
m'en règle
et je vous recommande
de ne pas
jouer un peu
de jeu de rouleau
comme enfant
pour être
bien participé
ou
de proctorer
un tableau
vous avez
d'autres intérêts
des opportunités
un peu
plus communiques
que vous pourriez
être dans votre
travail
et vous êtes fascinés
avec comment
les autres
les gens
apprécient
ou comment
vous voulez
apprendre
absolument
j'aime aussi
que
et je veux juste vous dire
que pour Google
on a
SREDU
qui fait
une forme de
ça
et c'est
je pense
un programme de training
pour les nouvelles SRES
et vous vous en prenez
un avant
pour les
les water
exactement
j'ai co-foundé
cette groupe
et c'était
très important
dans notre design
c'était qu'on voulait
que les gens
ont un tableau de rouleau
pour les
fireworks
pour qu'ils ne
ont pas de peur
pour le premier temps
je suis allé
Google
sans
trop donner
beaucoup de sauce
je vais dire
que
une chose
cool
pour l'année
2
ou 3
sur le travail
vous devez
être en
call
pour un service
hyperthélicique
fully
pas
que les reales
c'est
des robots
qui veulent
faire
des clics
et vous
apprendre
les technologies
de Google
et comment
ils se breakent
pourquoi
cette page est
sur le table
et avant
vous le savez
si vous
avez
donné une lecture
sur la

et la production
de Google
vous êtes
en train de

un
partage

la
networks
que votre stack
s'occupe
et c'est
cool
donc
par la fin de la semaine
vous avez
sauvé
le universe
4
5
de
les conséquences
parce que
ce universe
est synthétique
mais
c'est un
peu


réel
et
un

réel



réel
et
un
peu
un
éclair
c'est
réel
mais
des opportunités
pour votre
plus
pour vous
pour les
opportunités
pour les
les
les
sans
être
embarrasse
c'est
important
et
il ne
devrait
être
normal
parce que
vous avez
l'incentivité
et
le


les
burghur
sur
..
en

dans
différentes
broadcast, est maintenant atteint large audiences.
Et on va continuer la conversation.
Right.
Si les gens ont des questions
sur ce que nous avons parlé aujourd'hui,
viens me trouver sur LinkedIn ou quelque chose comme ça.
Et je serai heureux de répondre à des questions.
Mais c'est fascinant pour moi de voir
comment l'art et science de Vesery
ont évolué, en Google et tout le monde.
Ce n'est pas un cliché quand je dis que
je suis vraiment excité à voir ce qui se passe.
Je serai là et je vais essayer de être
réfléchis d'en avoir le plaisir.
Right.
Comme je pense que nous tous.
C'est donc again,
point de havre kobe k?
ych descre
j Industrial




Tiou

Les infos glanées

Je suis une fonctionnalité encore en dévelopement

Signaler une erreur

GoogleSREProdcast

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Tags
Card title

Lien du podcast

[{'term': 'Technology', 'label': None, 'scheme': 'http://www.itunes.com/'}]

Go somewhere