
Life of An SRE Episode 1: Tom Cranitch and Megan Yin
Durée: 27m14s
Date de sortie: 12/09/2023
How does one become an SRE? And what’s the career like? In this episode, Tom and Megan discuss their path to SRE.
Hello and welcome to the very first episode of season 2 of the Google SRE podcast, or as
we affectionately refer to it, the podcast. I'm your host MP back from season 1, and on
this season, we will be exploring the life of an SRE. And to do so, we will be speaking
with individual contributor SREs and members of SRE management from all across Google and
all levels of tenure. If you are interested in becoming an SRE, growing your career as an SRE,
or just want a little bit of a peek into what makes SRE culture here at Google tick,
this is going to be the season for you. And I do have one thing to touch on before we dive
into our first interview. Viv, my co-host from the first season, has decided to jump in the world
of responsible AI here at Google, and is no longer with us in SRE. We are very grateful to her
contributions to the podcast, and we wish her the absolute best of luck in her new role. In light
of that, and in keeping with our theme this season of trying to share as many voices as we can,
I will be joined by five other SREs across the season as my co-hosts. So here for episode 1 is
my very first co-host, who I am very excited to introduce. Go ahead.
Hello world. Sorry, I've always wanted to do that. My name is Pamela Vong. I also go by Pam.
And currently I am a senior site reliability engineer in the geo team. I have been at Google
and been an SRE for almost one and a half years now. So I'm still very fresh and new to all things
SRE. But prior to joining Google and SRE, I have been mostly a software engineer for the past almost
15 years on a variety of all technologies and feel a little embarrassed about not knowing what site
reliability engineering even was two years ago from the time of this recording. And at that time,
when I was learning about it, I wish that there was a podcast for me to listen to, to get the gist of
what this whole area of engineering is like and what the people are like and the culture is like,
because I listen to a lot of tech podcasts and I've been a huge fan. And that's how I usually get
my news and information and how I learn about everything that I do. So I was super excited in the
beginning of 2022, when the podcast got released, and I binge listened to every episode, and was also
equally excited to hear when MP was looking for volunteers for season two. So here I am. And thank
you for having me. So grateful to have you as both a listener and cohost. So for our first episode, we are
going to be starting with some early career SREs here at Google, and to hear that very fresh-eyed
experience of SRE. So let's go ahead and have our guests introduce themselves. Hi, I'm Tom. I also work
with Pam on the geo team, but over here in Sydney, Australia. Nice and warm at the moment. I've been at
Google for about 10 months so far. And I joined as a new grad.
Hi, I'm Megan, and I swim with Tom also am about 10 months in to my career at Google. I joined out of
college, and I'm working with MP on the play SRE team as a software engineer.
Thank you both for joining us.
Megan, let's start with you. Could you tell us about your experience prior to joining Google?
Yeah, for sure. So prior to joining Google, I was a student at Cornell University. I had done a couple of
software engineering internships prior to joining Google. When I was in the interview rounds with Google, my
recruiter did suggest SRE to me as potential avenue to go down. And I, like you, was not very
familiar with SRE at all. And I had done some reading. I think there's an SRE book out there for
public consumption about SRE. And I was pretty interested in what SRE is. It's very different from my
experience in software engineering. So I decided to pursue that. And ultimately, this was also how I ended
up in the San Francisco office.
So Tom, tell us a bit about your experience prior to joining Google.
So I did my undergrad in math and computer science up in Brisbane at the University of Queensland.
Through uni, I did some work at the Defence Science and Technology Group, which is the research division of the
Department of Defence. I was doing formal methods research there. So looking at programme safety. So it was a bit of a
change coming into SRE, I think, coming from a more like formal research environment. I sort of got to the last
year of my undergrad degree. I was going to do an honours year, which for folks in America, I guess is sort of
like a masters. It's a transition we have to get into a PhD. And I decided in my last year that I wasn't keen to do
another year of study. And so I started applying to places and I went off from Google.
Did you know about what SRE was prior to applying to Google?
Not really. I had a similar experience to both of you guys. I applied to a job at Fitbit actually. There was a new Fitbit
team in Sydney. My recruiter then asked me if I was happy for any job at Google. The obvious answer, that was yes.
So I went through the interview process and I got to the end and drink team match. My recruiter floated the idea of SRE and
similar sort of experience. I looked into it more and I really liked the systems focus of it, the large scale
system design, particularly I found that really interesting. I got an offer from a suite team in Sydney as well as
GOSRE. And like after speaking to both of the managers, I just found the SRE work was more aligned with what I wanted to do a long time.
Megan, how did you feel about SRE when you learned about it and what made you choose SRE?
I was pretty keen on choosing SRE because it's very different from software engineering. I really liked that SRE you're exposed to a large surface area versus a developer.
You probably get really deep knowledge in like maybe one binary. But as an SRE, you can kind of see like the whole forest that is, in my case, the Play Store.
Also as an SRE, you're kind of solving different problems all the time versus I thought that as a developer, the work was very repetitive and that you're kind of doing design, build, test, repeat.
And that was kind of what was interesting to me about SRE.
Was that design, build, test, repeat, the experience you had during your internship?
Yeah, I did have that experience sort of through my internships. I kind of felt like I was just kind of like building APIs and kind of just going through the cycle of testing them.
And that was also sort of what I experienced through my like software engineering coursework in college.
What was it like for you both to transition from academics to your professional career?
What's the most challenging thing you've had to come across?
For me, the most challenging part, I think, was just like the pure scale.
Like uni work and the research work I did, it was like, not necessarily super well scoped, but it was work that you could do sort of in the period of weeks or maybe a month or two.
And you could really get your head around things relatively quickly.
A uni course was scoped to a semester and in that time you meant to like learn all the content, do the assessment, all that kind of stuff.
And I found joining Google, like one of the first things my manager said to me was, I don't expect anything from you in the first six months.
And universally, I think that experience is something that most people get at Google.
Just like the pure scope of everything here and how long it takes to get your head around the systems that you work on.
I think particularly in SRE, there's so many systems.
GEO has something like hundreds of binaries, right?
And then also the developer tooling and everything, like everything at Google is new.
And just the amount of time it took me to get my head around.
I mean, I don't have my head around it still, but like to get to the point where I had my head around it enough to be productive.
That was really new for me.
I just want to also echo what Tom said.
A lot of it is just being able to thrive and ambiguity, which is like a huge phrase that we use within SRE.
Because SRE has such a large surface area, there is a lot of ambiguity in that and there is a lot to learn.
And just like what Tom said, the first six months is a lot of learning versus like in college, like six months, you're probably done with the whole course by then.
So yeah, transitioning to that and just being able to constantly learn while also still producing was a huge transition.
You've both had an interesting transition into your careers also for another reason.
You came into Google during the time Google was returning to office.
How did that impact your onboarding?
I'm actually like super grateful that I was able to onboard in the more returned to office time.
I definitely really appreciated being able to see my co-workers in the office and just being able to kind of like go up to their desk and be kind of like.
I'm running into this issue.
Could you like help me out with that?
Through my internships, those are all pretty much virtual.
And it's really great being with co-workers in the office.
And especially when I was shadowing on call, which is a huge part of the SRE role at Google, it was really great to see my co-workers triage
in like an incident in real time and just be able to go up to their desk rather than having to set up a meeting to kind of see how they were handling the problem.
Yeah, I would definitely echo that.
I joined a few months before RTO, our RTO in Sydney was a little bit later than in the US.
And coming into the office initially, it was kind of eerie, right?
Like, there was no one around except for my team.
I was quite lucky that I think that my team was in the office.
I really enjoyed coming into the office and just having those impersonal interactions I found so valuable.
And just, I think part of it as well is just like overhearing what other people are talking about.
You get so much context from the conversations that people are having around you, I think.
And definitely echo, just being able to turn to someone and ask a question, instead of having to try and perfectly word this message ascending to them because you've met this person once online
and you don't want them to take your message the wrong way and all that kind of stuff.
Just being able to turn to someone and ask a question.
And then if it turns into a bigger thing, you can just go to a meeting room and have the conversation.
But not having to balance those online interactions with people you don't really know.
It's super valuable.
Yeah, those magical hallway conversations are huge.
You've mentioned ramping up to go on call.
Megan, do you want to share what it's been like doing on call versus the project work you get to see?
Yeah, for sure.
So like I said, being on call is a huge part of being in a sorry Google.
I've seen that we tend to budget a lot of time towards being on call and following up with on call.
And I would say it took me about six months to get ramped up for that.
I got ramped up for that by doing some practice exercises and also just shadowing and reverse shadowing.
The difference there is that if I reverse shadow, I'm in the hot seat and someone who's more experienced is there to support me.
So yeah, I was doing that for about six months and just getting a understanding for the system and the critical flows.
And part of what was really huge about that is learning which pages should you definitely jump for, for example, if they're revenue impacting, then those are probably things that you should probably escalate and learning that it's OK to escalate and to involve other people was definitely a huge learning for me.
Yeah, I had a similar sort of experience.
Oh, we have a relatively quiet page off for better or for worse.
So for me, my ramp was a bit faster.
I got to the point where I was shadowing and I was shadowing for a little bit and we didn't get any pages and my manager was sort of just like, you can go on call.
I felt a little bit like being pushed in the deep end.
But there's the safety net, which I think is hard to trust.
But like my manager, obviously, being here a bit longer knows that there's that safety net there.
And it's like, it's safe to have someone with not heaps of experience being on call because there's these escalation paths that were there to save you.
But yeah, like my to ramp up from call, we did a lot of ticket work.
So like slow burn pages go into our ticket queue.
And then as a team, every week, we sit down and spend about an hour looking through them.
And that was really helpful for getting me ready for on call.
Because we don't get a lot of pages like these are the closest things we have a lot of the time to real incidents.
And they are, it's a productive exercise as well, because these are things that are affecting production.
Like they're real slow burn alerts that need to be fixed by someone anyway.
That was definitely like super helpful for getting me on call.
For both of you, what was it like adjusting to the real time aspect of having to be on call for a production service?
Because you think about academic work or even normal software development work, you're not really in that position where five minutes, 10 minutes, 15 minutes, an hour really matters.
But then you're on call for these large production systems and those five minutes, 10 minutes can be really important.
And I know I've been supporting production systems for seven years now.
Like I'm kind of desensitise to it.
So what's that like for both of you having to make that adjustment into urgent response?
Yeah, I actually really enjoy it.
We had SREDU, I did it, I think, in my third week or so.
And there's these breakage scenarios, right?
And there's this fake service and things break and you have to try and fix it.
And I found it really enjoyable.
And I've continued to find pages to be quite enjoyable.
I think partly like I enjoyed exams, which is like the closest thing you have to these time pressure scenarios at uni.
And I found that that time pressure I quite enjoy.
So that aspect I actually haven't hated.
One thing I do find a bit stressful is for us, at least my experience has been that a lot of my pages are manual pages.
So something has broken for someone in a service that is related to us or something has broken in our service.
And we haven't been alerted, but some other teams experiencing pain because of one of our services.
And they turn to SRED as sort of like production experts.
That I found really scary.
There's this person who's waiting on you for their service, which is broken.
And there's like pressure of someone waiting on you.
I actually found that worse than these automatic pages where even though the revenue impact there may be worse and the user pain may be worse.
The fact that you know that one of your colleagues is waiting on you to reply to them, I found that almost more scary.
Particularly, it's like a new grad, like this person who is far more experienced, most of the time is waiting on me with five months experience to sort of get the masses.
Coming from my perspective, my team, I guess, has a little bit of a heavier on call.
We do get quite a few pages that are more revenue impacting.
And I remember the first time I dealt with a revenue impacting large outage was when I was reverse shadowing actually my tech lead.
And that was incredibly nerve wracking to me to see so many people kind of get involved.
It also be cognizant of the fact that Google was also losing money for every minute we were going through this.
However, my tech lead said something very interesting to me.
He said that SRE is for the people who tend to run towards fires instead of away from them.
And that is, I do feel entirely true.
And it's really one of the more rewarding aspects of this job.
I would say that real time aspect, like you said, MP, it's really interesting to see how these breakages kind of affect people, but also know how you can triage them and fix them.
Another big learning for me was, like I mentioned before, just learning which pages and incidents are worth getting stressed over.
And obviously, like the ones that affect our critical flows, our revenue impacting are definitely ones that you should be a little bit more stressed about, but more one off pages for tasks that are dying or something like that.
You can maybe take your time with.
And that is huge.
And also, just like I said, knowing that you can always escalate and people are always happy to get involved, especially if money is on the table.
I think one of the things that is fairly well known with NSRE, but it's not really like even spoken that much, is how much authority is delegated to the on-caller.
And there's this really wild dynamic that I'm sure both of you have started getting used to that you can have managers and directors, like looking at you for guidance for the answer.
And you might have just been on the rotation for a few months at that point.
And you have to respond to that.
You have to move everyone forward and be that point person.
I remember having this conversation, we had someone senior from the US come and visit us a few months ago.
So I have been on call for a month or two at that point.
I remember having a very similar conversation with them.
It's scary.
The whole company puts so much faith and trust in you as this person who has been at Google for like five or six months.
Google Maps, right?
Like how we have however many users.
And the whole service is relying, or I guess this is for, we have four, SRE shards.
So there's four on-calls for Google Maps in SRE.
Basically, the whole service is sort of resting on these four people.
And your portion of that, the services you're on call for, which are all super important.
The whole company is trusting that you will keep them alive.
And like, if not, like, obviously, there's escalation parts and stuff, but I still find that shocking.
Just like the amount of faith and trust everyone puts in SRE on-callers in general.
Yeah, I would say like something I like to think about SRE is sort of like the guardians of Google production.
It's a way I like to think about it.
And yeah, it's an incredible honor to have this much responsibility placed on you.
But, you know, with power comes great responsibility or I forgot what the quote was.
Spider-man quote.
That's also the pseudoers.
At first time you type pseudo on a system with great power comes great responsibility.
Yeah, so I do feel that every time like somebody sends me a code change to approve that,
that code change could definitely come back to page me if I don't put the necessary review into it.
But it's also super like forgiving culture.
We're always things do break and also having the great support that the SRE community has.
We all know that we have so much responsibility placed on us, but we're also super collaborative
and it's always OK to escalate if you don't know the answer.
I would definitely second that.
Like, as soon as there's an incident in the office, the number of people that surround the computer,
you walk past another team and you can tell if there's an incident going on
because there's suddenly eight people surrounding this, like one monitor.
And like, everyone's there to help each other.
And like, as soon as we have a page in our team, like,
everyone's on their computer looking at dashboards and like sort of sharing.
And it's super helpful for like us as well.
Like, the person who's not on call because you get practice dealing with pages.
But like, this like collaborative thing is definitely it's been like super rewarding.
I think that's definitely one of the highlights that drew me into saying yes to SRE.
The whole aspect of this larger team that is going to stand behind you and help you,
but not blame you if something goes wrong.
Going through the SRE book, the whole culture of blameless postmortems,
I think really spoke out and resonated with this empathetic culture
that exists in the side of engineering that I don't think I have seen
in all the other aspects of my engineering career before.
Yeah, the blameless postmortem culture is something that sounds great on paper.
And then you see it, it's like, I think it's really hard to trust
until you see it in practice.
And like, you see it over and over again, that this is actually so deeply embedded
in SRE culture at Google, I think across all of SRE culture at Google.
I think it does probably come out of SRE.
And do you see it reinforced over and over again?
I think it takes a while to trust that there is this blameless culture.
Yeah, exactly.
I think the super good approach to it is to always blame the system
because if any one person could have taken down production,
then it's a fault of the system, not that individual person.
I mean, maybe we need more safeguards in place.
Maybe we need to review some of our production policies,
but that way you can kind of make it so that this doesn't happen
in the future rather than placing the blame on any one person.
I think you start to learn over time as well
that it's not very productive blaming anyone.
I trust everyone that I work with.
I trust that they're not doing anything malicious.
And I trust that they made the best decision that they could at the time
in the same way that they're going to trust that I made the best decision
I could at the time.
Et si vous vous avez dit que les gens ne sont pas productifs,
il y aura des changements qui vont arriver à l'exploitation.
Mais si vous pouvez trouver des faultes dans le système,
nous pouvons fixer des choses concrètes pour les réimprouvoir.
Et même si c'est une faute, on va prendre la décision humaine
et prendre la décision humaine en sorte de la faire.
Oh, oui, 100 %.
Les humains sont naturels et faultes,
à la fois que nous nous faisons des erreurs.
Et j'espère que le système devrait pouvoir le prendre.
Je suis curieux de voir ce que Tom a dit.
Tom, tu as mentionné de la façon de apprendre
de la confiance dans la culture postmoderale.
Je me demande si tu as d'autres pensées sur ce que tu veux partager.
Oui, je pense que peut-être que ça va arriver à l'Union.
Il y a cette culture de travail individuelle
et de grades.
Et tu es assis sur tout ce que tu fais.
Et donc, en fonction de l'environnement,
le travail project est aussi vrai, mais particulièrement sur le call.
C'est ce type d'effort de la team pour garder tout en fonction.
Et personne ne doit vous donner un grade sur comment vous faites.
Même si vous êtes le single on call,
c'est vraiment un effort de groupe pour garder tout en fonction.
Et dans une même manière, je pense, avec les parties d'escalation,
ça prend un temps long.
Ça a pris un temps pour moi de croire
que personne ne va vous faire briser si vous escaliez
ou que vous payez quelqu'un.
Gioa est un équipe de team d'insultation
que nous pouvons escalier si nous sentons que nous ne pouvons pas défendre un incident.
Et personne ne va vous faire briser si vous escaliez.
C'est ce que leur rotation est pour ça.
Et je pense que ça prend un temps pour moi de croire
qu'il y a cette culture de briser et de se faire briser
avec ces parties d'escalation
où personne ne va vous faire briser si vous escaliez.
Oui, exactement.
Je crois que la philosophie est toujours de ne pas être affaire
à escalier, à l'escalier plus tard.
Et personne ne va me briser si vous ne savez pas comment résoudre quelque chose.
C'est très collaboratif.
Je pense que l'escalation est aussi l'une des meilleures manières
pour apprendre à faire des pages.
Vous avez quelqu'un qui est plus expérimenté
que vous, que vous travaillez avec.
Et c'est un grand moyen de apprendre
comment ils font des pages et devenir un peu plus de couleur.
Vous avez besoin d'avoir
de faire des ajustements
au-delà de l'emploi
pour juste accommoder la stress d'assistance
de la production de la vie.
Je ne pense pas que Google
ait une bonne balance de travail.
Je pense que
je vais aller au moins un peu.
Et je dirais que les jours où je suis en train de s'en parler,
ma shift est généralement 11h00 à 11p.m.
Et ces jours, quand je suis en train de s'en parler,
je ne peux pas aller au moins un peu.
Mais je l'ai apprécié à ces jours,
même quand ils sont sur les weekends,
parce que ça me donne un moment
pour me faire rester à la maison,
me faire relaxer,
et beaucoup de choses comme ça.
Je me suis trouvé exercisant
pour faire un peu plus de choses cette année.
Et je me suis trouvé vraiment aidé.
Je pense que c'est un mix de
générales stress de travail.
Google a une balance de travail excellente,
mais en temps de s'en parler,
c'est un peu stressant.
Et je me suis trouvé exercisant
pour vraiment s'adapter à ça.
La force de temps à la salle
pour s'en parler, je me suis trouvé aidé.
Je peux toujours aller et faire des choses en vie
et me faire appeler le laptop avec moi.
Mais je dois juste être un peu plus calme
sur ces jours et ne pas faire
rien trop fort.
Cette force de temps à la salle
a été vraiment aidée.
Ça a été super.
Qu'est-ce que vous
vous êtes plus en train de voir
en train de créer vos careers de SRE ?
Je suis fort enough
de travailler avec des super talentants
et super ingénieurs de 10 ans
sur ma team.
C'était un peu dangereux
à première fois de apprendre
que beaucoup de mes co-workers
ont été à Google
pendant 5 ans
et de avoir
beaucoup d'ownership
et beaucoup d'knowledge
en production.
C'est vrai que, comme une SRE,
il y a un ton de apprendre
et ça prend beaucoup de temps
pour faire ça.
Mais je suis vraiment
en train de regarder
tout le développement
que j'ai dans les années précédentes
comme une SRE à Google
et juste le développement incroyable
qui peut arriver.
J'ai beaucoup de modèles
dans mes team-members
et
je regarde les appareils
de la formation.
C'est une grande honneur
et je regarde les appareils
de remplir leurs choux un jour.
Je vais définitivement
écouter le développement.
L'amount de développement
que j'ai fait
a été un peu mal.
Je pense que vous allez apprendre
moins de choses
quand vous allez en France.
Mais je n'ai pas trouvé
cela à être le cas.
Je suis en train de
regarder
ce qui se passe.
Dans le futur,
GEO est en train de
redesigner le système
en sceintant
et la SRE
a une grande main
dans ça.
Et je suis vraiment
heureux de travailler
sur ce genre de
faire
le futur
de notre travail
au système.
Ce genre de travail
que je suis vraiment
heureux de travailler.
Merci tous d'avoir
été avec nous.
C'était bien.
C'était un plaisir
de vous tous.
Merci Pam.
Merci MP.
Merci d'avoir été avec moi.
Merci.
Episode suivant:
Les infos glanées
GoogleSREProdcast
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Tags
Life of An SRE with Shannon Brady and Theo Klein