
SRE, a Basis of Influence with Amy Tobey & Vlad Ukis
Durée: 41m2s
Date de sortie: 04/09/2024
In this season of Google Prodcast, current and former SREs, both within and outside of Google, chat with hosts Steve McGhee and Jordan Greenberg to discuss software systems designed and built by SREs.
For "episode zero", guests Amy Tobey and Vlad Ukis will set the stage for the season with a lively discussion about what Software Engineering means to Site Reliability Engineering.
This season is produced by lead producer Sunny Hsiao and sound engineer Paul Guglielmino, with contributions from Salim Virji and Florian Rathgeber. Special thanks to MP English and Jenn Petoff.
Welcome to Season 3 of the podcast. Google's podcast about site reliability engineering and
production software. I'm your host Steve McGee. This season we're going to focus on designing
and building software in SRE. Our guests come from a variety of roles both inside and outside
of Google. Happy listening and remember, hope is not a strategy.
Hello everyone and welcome to Google's SRE podcast or as we like to call it, the podcast.
This is the first episode or as we like to say the zeroth episode in season 3.
Today we have Vlad and Amy who are going to help us out a little bit and we're going to start off
our season. This is going to be the third season. We're going to be focusing on software engineering
in SRE. And the topic we're going to kind of focus on today is going to be what happens when
software engineers solve production problems and what's the importance of the software engineering
mindset within SRE. And there's no way means like software engineering is the only thing or
like you should only be writing code 100% of the time. Certainly not. In fact, I think we'll kind
of get into that a bit. But before we do that, why don't we have our guests introduce themselves.
So let's go alphabetically because that's the best. Everyone likes it. And Amy.
Yeah, hi folks. I'm Amy Toby. I have been doing tech in what we call SRE for about 25 years.
So before that name came around, that's kind of what I was doing. Very quickly out of college.
Today I work at Equinix as a senior principal engineer where I'm metal in product work and SRE work
and software work just all over the place. Awesome. Thanks Amy. And Vlad.
Hi, hello. Thanks for having me on the podcast or on the podcast. And my name is Vlad. And
I work for a big company called Siemens Healthineers, which is the healthcare arm of Siemens.
And they are responsible for the so-called Teamblade Digital Health Platform,
which is a software service platform for healthcare applications and digital services.
And this is where SRE comes in very handy because you need to operate the platform somehow.
And this is where we use heavily the SRE methodology. And I'm looking forward to
talking about it with all of you on the podcast.
Well, thank you both for coming today. Just to get started, I think we want to start actually
with the kind of the inverse. So specifically, we have a question here for Vlad, which was,
how do operations kind of serve as the core part of SRE, which might feed into software
engineering later on? Like, what do we even mean by operations to begin with?
Right. So in my view, SRE is a methodology for running digital services reliably at scale.
It's about running the services. So we are firmly in the operations arena.
But the curious thing about SRE is that it applies a lot of software engineering methodologies
in order to achieve that reliable operations at scale. So it requires you, for example,
to implement some sort of SRE infrastructure that enables development teams to operate their
services with acceptable cognitive load. And it then requires the software engineers
to actually use that infrastructure in order to do the operations of the services.
So it kind of requires the operations faults to implement something as
software engineers to be used by software engineers, and then requires the software engineers
to go on call for their services for an agreed period of time of the day.
So kind of working as operations engineers. And I think this is what makes it interesting.
And this is where we see the software engineering kind of making its way into SRE.
So that's on the one hand, I'd say, but also on the other hand, if you take a broader view of
what SRE is trying to achieve, it's then actually reliable and scalable services.
And that means you need to kind of weave in that SRE thinking from the beginning to end
of the service lifecycle. So that means you need to apply software engineering things like,
you know, what kinds of stability patterns do we implement now into these services?
You know, where do we put circuit breakers? Where do we need bulkheads and so on?
So you need to kind of have some solid background and software engineering in order to
do this across the life cycle.
Totally agree. Amy, how do you feel about Vlad's definition so far?
I like most of it. I tend to take a little bit more expansive and kind of
sideways view of it, because what I've seen a lot of in the field is a lot of people look
at SRE and they're overemphasized on the software part. And then we talk to our folks,
our friends at Google, and they're like, no, no, it's really about what a software engineer
would do in an operational context. And I think they're both wrong, because like what I look
at when I see what effective SRE that I've done out in the real world, and I'm just going to say
Google's kind of not like the real world, it's this magical fairy place where people get to have
fun and have cool tools and good software tools. It's a real place. But like what I'm
getting at is there's the real essence of what we were working on, are these feedback cycles
kind of across the software engineering lifecycle. So starting at the very beginning,
we want to get involved in product management and in kind of product conception and start to
insert reliability and security, because it's always on the way for me, requirements into the
product process. And then all the way at the end, if there is an end, well, there is an end when
we actually shut the service down, but like maybe that long middle where we spend most of our time,
where we're constantly looking at where can we improve the feedback signals, and we implement
that through a bunch of different ways, like SLOs, going on call, all of the DevOpsy kind of stuff
that we try to integrate CICD, it really comes down to like how do I tighten those feedback
cycles between what my customer is experiencing, what my software engineers are doing, and what
my leadership believes we're all doing. I think that last bit is actually a great one as well,
which is what does the leadership think we're doing. One kind of take on this historically
has been like the idea of not accepting 100% reliability as even a reasonable goal.
The jokey way I like to describe it is that you're asking for something that's unreasonable,
you're just telling people to lie to you. And this is what happens with teams that are
given unexpected or sorry, unachievable goals like 100% availability or reliability in some way.
It's true, right? Like you look at 100% stat and you peer at it long enough and you stare through
it and go look at the actual metrics, somewhere, some statistical significance is being dropped.
Yeah, or it's 100% except for X, Y, and Z, because we're just leaving those out,
because I mean, come on, you wouldn't want to leave those in.
Because they're just not convenient.
Yeah, that's right. Have either of you encountered a situation where something like this is going on,
where the metrics look great and when you dig into them further, it turns out something funny
if that has happened, what do you do? What's the thing here? Is it just talking to the teams in
general or talking to the leadership teams and saying, hey, don't do that? Or is there something
more subtle here? Well, I think that happens all the time with the definition of SLOs. So we go
to a team and they're totally new to the party. So they've never done any SRE and you started
advising them, okay, so maybe we would select some SELIs that would apply to a service and then
inevitably they come up with something like availability, right? That's kind of important
for any service. And then you work with them and then they kind of come up with an initial set of
SLOs and they think, okay, so our service is so and so and so available.
And then what happens is that, well, you feed all this into the SRE infrastructure and then
a couple of days later, everything is red because all the SLOs have been broken,
right? And you kind of come to them and say, okay, now have you looked at the SLOs? First of all,
no, because they're not trained yet to do this, but then you show them, okay, so all the SLOs
have been broken. What does that mean? Well, actually, the customer hasn't called up and didn't
say that something is wrong, right? So there you've got already the cognitive dissonance,
right? So the SLOs have been defined, they are all red, but the customer actually didn't
experience it as being so painful, right? And that kind of happens all the time until you get
the team to a point where those things converge, right? So the definition converges with the user
experience and when it's red, then it actually means the user experience got broken. And when
it's green, it's actually green means the user experience is okay. And once you get them to a
point that this is an ongoing refinement process, then this is where you know, okay, you are getting
you're getting somewhere. Totally. I have run into that. In fact, there's still a dashboard. I'm
working slowly to eliminate that shows a number of services SLOs,
that is measured by a instrument that I don't particularly find very reliable,
because it doesn't represent the customer experience. And there's no technical problem here,
is actually really interesting about it, right? Like, there's technically the system works as
designed, it is instrumenting as design, it is reporting metrics that are actually somewhat
representative of what this system is doing, but have almost no bearing on the customer experience,
that's the first problem, right? But the way I approach this is not to go and start poking
out the technical stuff, I tried that for years. And basically, like every time I poke at it,
they're like, well, everything's working exactly as we designed it. Well, yeah,
cause you designed the wrong thing. Right? But the thing is, it's been in place for years.
And so this is, this is my world, right? It's been in place for a long time.
And so every time I try to approach it, be like, I think we could do this better. Everybody's like,
oh no, it's been there the whole time I've worked here, it's fine.
So this is really more where we start to get really deep into leadership and having to go around
to the product managers and not just the people who own those particular,
like the monitoring system there, but like each of the product teams, product managers,
and point out to them how broken this is, and then go back through their incidents and go like,
this one, this one, and this one, and this one, and this one, and this one,
all caught you by surprise because your monitoring doesn't actually work,
even though you go look at this dashboard, right? But that still, you would think,
would be like, you would show somebody this fairly obvious to us truth,
and then we go, wow, I got to fix that. Except there's so many other things going on that largely,
they're just like, oh, okay, yeah, that's really important, but I'm going to go work on
this other much more important thing. And so it's still there, and we're still chipping away at it,
right? And we've got most of those people starting to do higher level monitoring so that they kind
of come through this journey where instead of me telling them that it's wrong, basically,
we have to take them down this longer road where through like instrumenting with open telemetry
and using more modern tools, eventually these folks will come around and go, oh, yeah,
that's really bad. And then they'll tear it down, but we're just not there yet.
This sort of reminds me of like, this is a kind of a high level,
hand wavy definition, but like the difference between like programming and like professional
product software engineering, you know, you can technically do programming just fine,
and like, it'll do what you say, like, but if you're not designing the right thing at the right
time for the right people and getting feedback and doing the whole thing and like working with
your peers and like collaborating, and like, there's a bigger, bigger thing over there,
which is like the, you know, capital s capital e software engineering,
we put a W in there sometimes for some reasons, kind of funny.
I tell people all the time, right? I write code for free.
Yeah.
I get paid to do JIRA.
So there's a trend going around these days around platform engineering.
And sometimes we talk about like internal product management.
One way that I've heard, as is described, is just like another set of influences,
or influencers, maybe, to that internal product or to that platform, in terms of like,
you know, representing the reliability or the security of the things that are being
shipped through that internal platform.
Do you guys have any takes on that or experience in kind of thinking about reliability
through this internal platform concept, or does it just kind of happen that way?
Well, I would say that the SRE infrastructure that you're building,
ideally, would feed into the bigger platform that you design for your engineering organization,
would kind of feed naturally into what you've got.
So it might start as a separate thing, but overall, it's, it would be great to kind of
get to a point where the platform that allows you to do pipelines also allows you to set
and sell those and feed them into the infrastructure and getting the bridges reported
to you all in one kind of more or less seamless experience from the engineering point of view.
I think this is kind of a place where, where you need to get to.
And I think in order to get there, you need to have product thinking for the platform.
Right.
And that means you need to have product owners.
You might call them kind of technical product owners for the thing.
Otherwise, there is no kind of product thinking that you are not going to end up
having a product, but probably kind of, you know, a bunch of stuff that is engineering led,
but doesn't look like product.
Totally.
I think platform engineering is a little bit of a mistake and it's all Google's fault.
And it's because one of the things that they left out of the Google book,
or the original SRE book that I think drives everybody outside of Google crazy,
especially me, is like the leadership structure that's in place at Google
that enables SREs to speak up and be heard.
Right.
So that this is something I learned over the years from various folks
I've known that have worked there, but they're a very high level leaders,
I think all the way up to the senior vice president level, who represent reliability.
Right.
And most businesses have that for security, but they don't have it for reliability.
That's pretty unique to Google and maybe a few other places.
And so I think a lot of why platform engineering is arising is because,
well, one, it let's just take a lot of these concerns what we're building
and actually build a platform where we can just encode them into the platform.
So the developers don't even really have to think about it.
But the reason that we're driven to do it that way
is because without that power base, right,
like I'm going to talk about just power dynamics here, right,
without some kind of platform with some kind of buy-in and gatekeeping,
really just gatekeeping, it is super duper hard to be heard as an SRE.
Because you're just this voice out in the wild, right.
And like I was saying before, it's mostly the leadership
that are like trying to ship products and keep the business going
and make their P&L go up and to the right.
They're just not thinking about it and they don't want to hear about it.
They really don't, right.
They don't want me showing up on their door and going,
hey buddy, you've got a huge architectural risk right there.
And that thing's going to catch on fire one of these days.
You're going to have a bad time.
I could fix it now.
It could cost you half a million bucks.
Right.
And they go, ah, I'm just going to roll dice, right.
Because I don't have any power.
I can show them the risk and they can make a business decision on that risk.
But humans are notoriously bad at this.
And so I think that's really where platform engineering is truly an evolution
of SRE and DevOps and all of the things that came before it.
But the real pivot that it's bringing to the table
is that installation of somebody in power
who has a way to say to the leadership team,
like no, you're actually going to do reliability.
And that's not an option.
Yeah.
I mean, part of modern or not even necessarily modern,
but as we were saying, professional software engineering is prioritization.
Being able to say we're going to build this thing and not that thing.
And when priority inversion happens,
it's often because of a lack of information
or because of a lack of sometimes a champion.
So having a seat at the table during that prioritization
is super important.
And I agree.
That was completely missing from the book.
I think that was even left out on purpose
because we didn't want to create any,
like you have to be like Google vibes.
But maybe at least just saying it would have been helpful
in that structure, in that book that people saw.
Let's keep going.
So Amy, how about when you have operators and system admins
like that haven't done software engineering in the past?
Have you had any good experiences in leveling them up and training
and all these kinds of things like A, does it work?
B, how difficult is it?
C, D, E up to you.
How do you feel about this?
Yeah, yeah, yeah.
So A, it does work.
B, it's super duper hard.
It's the hardest problem in business.
Like how do I up level a workforce?
In any domain, go and ask some business leader
what's the hardest problem from?
Like how do I up level my workforce?
So just out of the gate,
it's the typical move that we've all seen
and like I'll just repeat it so we're on the same page.
Is a lot of places looked at the Google book.
They probably didn't read it, to be honest.
They flipped through it and they went,
look, this might help me.
I like reliability.
And they renamed their assistant men team to SRE.
All the story since SRE was invented.
And when that happens, it has this effect.
It does actually do something.
So to sit here and say like renaming the team does nothing,
that's a lie.
Because we've re-oriented their focus a little bit
right out of the gate.
So you start to see teams,
I've seen this probably four or five times,
where some team gets flipped to SRE.
And they're largely just an operations team
and the good ones are doing Kubernetes
and they're starting to iterate towards a platform.
But they don't have that empowerment.
And so that's the big gap, I think.
There's kind of two major things.
There's a skill gap,
but there's also the empowerment gap.
And then we end up in a live lock with the business.
Because the business is like,
well, I'm not gonna trust you with this responsibility
because you don't have the skills.
And then the people sitting in the seats are like,
well, why would I go learn that if I'm not trusted with it?
And so we just go around.
And so they get stuck doing operational work.
And there's kind of all kinds of classic goodies
in operational work, where it is so hard
to get your head above water
to where you can actually start swimming.
So you just end up kind of bobbing up and down
and keeping your face above water.
That's most operations, most places.
So you have to start with kind of incremental growth
in SRE and bringing your people along.
And that's the only thing I've seen that works.
You can do the flip,
but you have to start to peel off some of the toil
and do it incrementally
so that you kind of make enough room
to start implementing one SLO.
And then you make enough room
to build another SLO or an incident management
or do a retro program or something.
And just snowball that.
But I've never actually seen anybody flip it.
And then everybody goes, yes, I have an SRE.
Now I'm gonna go learn how to be a great software engineer.
They go, what the hell is a sprint?
And why would I sprint for two weeks?
Like that's what most people say.
And that's usually where we start,
is with those processes.
So yes, you can do it.
It takes investment and it takes empowerment.
And I know in the case where the team wasn't just renamed
into SRE, but they got a raise as well.
And nothing changed.
Right?
The leadership thought, okay,
you know, if you raise the salaries,
if you rename the team,
then you know, they will live up to the challenge.
Right?
But you know, on the ground and nothing changed.
So what I found useful is if you have got an operations team
that has done traditional operations all along.
And you inject a real software engineer into the team.
So just software engineer that's got interest
in infrastructure stuff and automation and so on.
Then this can actually catalyse a change.
Where the software engineer starts building something
as a software engineer for a certain purpose,
for a certain user and so on.
This is where things can catch on.
Because then that, you know,
they kind of go to lunch together and so on.
They have conversations on a daily basis
and they've got kind of a connection
than to a software engineer,
which is so close.
They have never had that before.
And this is where you can catalyse the change.
I love that.
So embedded SRE is wrong.
Embedded SRE is right.
I'm not going to say either one's right or wrong.
I just thought that'd be funny.
That was the missing chapter in the book.
We fixed it.
Good job, everyone.
One of the jobs,
one of the purposes of SRE in a group is scaling.
Right?
It's like, the system's already there.
Sure, it's going to be adapted over time.
But one of the challenges is like just growth,
like handling growth and like stuff,
stuff happens at certain inflection points along the way.
Do you have any stories or insights
around like pitfalls around or, you know,
the opposite of pitfalls, whatever those are,
around scaling of these types of systems,
whether they're operational or, you know,
complete rearchitecting.
How do you see them coming?
How do you approach them, etc.
I think something like this could be a great catalyst
to actually allow the proper process for operations.
Like, sorry, for example,
because if you start to scale, right,
so the number of users is growing,
the session length is growing and so on,
then that will place higher operational demand
onto your services,
which means they will need to be more available
than before when the load was smaller.
But then if you don't have the processes
and the cultural substrate
in order to support that availability,
then this is where you'll come to a point
where you understand, okay,
we need to make some drastic change here,
otherwise, we are not going to be able to handle it.
Right, so we'll just not be able to provide
the availability that's needed
for that type of traffic that we have never seen.
So I think this could actually lead to a great crisis
that you can use in order to start introducing
proper processes.
That's how it often goes, right?
Like, you wait for an incident,
and then that creates the lightning in a bottle,
and then you can point that at a team,
empower them to go and fix the problem, right?
So I think Steve asked about,
like, how do you predict it?
But like, this is ultimately
how most businesses get it done.
Really, like, I think there's a rare privilege
a few of us have had,
like, rarely in our careers even,
where we're somewhere where things are like,
we can see the next 10x coming,
we can see the 100x coming.
Usually what happens though
is we join a business in a somewhat steady state
where it's at some single growth rate, right?
And then the actual problem we have
isn't scaling up.
So I think most SREs,
most folks carrying that title out there,
or have the opposite problem,
which is how do we make things scale down?
Because almost nothing is designed to scale down.
Lots of things scale up just fine.
Kubernetes scales up like a dream, right?
Amazon scales up like a dream.
You can add more instances,
web servers scale like,
they're super easy, well understood,
I don't say super easy,
but like, straightforward,
we have good tools we can apply to the problem.
But when it comes to,
in a lot of enterprises,
when we talk about scale,
what we actually run into
is things that don't scale down.
And so we have either like,
programs that want to be resident in memory 24x7,
so that they have like,
their own little built-in schedulers and stuff,
and so they're not,
they're all tightly coupled codes,
so like, yep, you're stuck spending,
you know, $10,000 a month to keep end servers up,
to keep those things running,
because they can't scale down, right?
And so that has a real business cost.
So that's, when I think of scale,
like, that's most of the problem
that I've worked on in the last few years
isn't the other direction,
because that's usually like,
add more Cassandra nodes,
add more Kubernetes nodes for good.
But instead,
how do I build something small
and economical,
and then iterate on it while it scales up?
And that's actually,
I think, far more interesting
for the majority of folks out there
doing enterprise work.
Yeah, it turns out cost matters.
Funny, huh?
Indeed.
Just depends on if you're like,
directly in a business
that has AI in its name or not.
All snark is welcome
at all times here, by the way.
Speaking of which,
do you have any examples
of a time where a tool,
method, product, thing
just changed the way
that you or your team looked at
or worked in production,
either for better or for worse?
Like, there was this thing,
they sort of like changed the way
we all approached, you know,
production as a whole,
whether it was a way of thinking
or a piece of technology
or someone's, you know,
way of thinking,
I don't know,
something like that.
So for me,
it was clearly getting the development teams
that have never done operations before
to actually look into production
by applying things that SRE suggests
by kind of defining the SREs,
thinking about the joiners
that would deserve SREs,
then feeding the SREs
into the SRE infrastructure,
starting getting feedback
from the SRE infrastructure
and then the cracks pants
reacting to the feedback
from production,
kind of, you know,
enacting that powerful feedback loop
of, you know,
defining something,
then putting it into production,
getting feedback from it
and then reacting again.
Kind of, you know,
instantiating the scientific method
in the operations arena,
so to speak.
I mean,
this just completely changes
the attitude of software engineers
to what they build
because they suddenly
don't just build it,
but they build it to observe it.
I 100% agree with Vlad on that.
I might add a bit of color on
around the last thing that he said,
which was observability.
I think today we're in a new world
of observability
that we were even five years ago.
Just with the open telemetry out there,
instrumentation,
kind of like for a long time
out in the world,
we had monitoring vendors
who had SDKs
that you could hook up to your code
and it would auto instrument it
and give you this wealth of dashboards
kind of for free.
There's some of them very famous for that.
I'm not going to mention their names
because they sometimes get upset with me.
But that's shifted.
It's all open source now.
Open telemetry is re-implementing
most of that auto telemetry.
So there's not much excuse
not to monitor your systems anymore
because largely what you do
is you point open telemetry at it
and it auto instruments
most of your network calls.
So you get an 80% solution
for fairly cheap.
And then there's this wealth of tools
that have evolved over the last few years
largely out of inspiration
from the hyperscalers
around Facebook, Scuba,
it evolved into Honeycomb
which is something that
I've been using at my work
in the last few years
with open telemetry.
And that's enabled that cycle
that Vlad described.
It's a small part of it,
but part of that cycle is
our developers
we could put them in the production stream
and that starts to create empathy
with the operations.
But the observability cycle
is what starts to change their mental models
such that they start to design
for operations earlier in the process.
So that's where I think
the modern observability
where we have distributed tracing
just kind of is so easy now
compared to even just a few years back
it starts to change the game
because now I can roll it out
and see what is it actually doing
as opposed to like
do these numbers on this graph
looks like kind of what I imagine
they should look like
from my mental model of the system.
It's a very different world now.
So I think tools like Lightstep
and Honeycomb
and some of the modern
Grafana
and stuff
is starting to really get there
and make this faster.
This might be leading
the witness a little bit
but that's okay.
I'm curious
when you introduce a system like this
and you introduce just like
conceptually like
hey we can take this x-ray to production
we can understand its bones
you know at a pretty deep way.
Does that land pretty quickly
with developer teams
who have never done it before?
This is the leading part.
It doesn't.
But how do you fix that?
Like how do you get it to work?
The way I did it at Equinix
was I embarked on like a six month project
where I just would stop by repositories
add the open telemetry instrumentation
work with the team
get it deployed
get it turned on
and pushing out to the metric system
and I really didn't get a lot of uptake
until I got almost all the way to the end
when I could start to show developers
and be like hey when you do this with our API
these are all the things that happen in here
and then they start going
what the heck is that?
That thing pulls?
Why is it pulling?
And why is it pulling every 10 milliseconds?
Like what the heck?
That's where you want to get them, right?
But until you can show them
a waterfall of like most of the system
it doesn't click is what I've found.
There's a few people who like get it, right?
And you say like oh it's a distributed
it's a you know directed basic graph
and they're like oh I love this
let me go do some tracing
and everybody else has to see it
and then they're like oh that's pretty cool
and then they start working with it.
Yeah, my experience is the same
it takes a long time to get a development team
that has never done operations before
to a point where they really care
to a point where they kind of notice you
and so on
and that requires team-based coaching
I found, I did it
where you really get the people together
including the product owner
who are supposed to all deserve this
end to end
and then kind of
really kind of working with them
and getting them to a point where
there is evidence where
you kind of start having insights
like seeing is believing okay
I've seen this so now
I believe it and so on
definitely
definitely requires a lot of dedication
and coaching
and also
once you kind of
you think that you've got them to a point
where they do this
you know
come back to them half a year later
and check whether they still do it or not
right?
Yeah, that's a good one
Yeah, and another interesting bit is
we mentioned before
the cost consciousness in the cloud
that can be a good driver for this as well
because if suddenly
you're logging costs skyrocket
and
only your CFO
finds it out
then this is definitely already too late
because then you are crewing the cost
at a much faster rate
than you should
and this
if that kind of
gets pushed onto the organization
to actually reduce cost
then you can only reduce cost
once you've got
certain visibility into cost
and it requires observability and so on
so that could
you know facilitate
the whole interaction
of looking into what's going on in production for you
Great
This is again
maybe a kind of leading question
but like
I'm curious
like one of the kind of
ideas that SRE teams have going into
adopting SRE
or maybe it came from the books
it's hard for me to really know
where this idea came from
is that
you know SRE teams
you know in some sort of platonic ideal state
will like
magically upgrade
you know the guts of some system
like the term that we've used
in the past has been like to
repair the
tires on a race car
during the drive
or whatever
repair the wing during flight
or something like that
like
how often does this actually happen
like do you see teams
specifically from an SRE
like either directed
you know hey we should do this
or like directly from the SRE team itself
actually swapping out the guts
of some running system
in such a way that it
continues to run and scales up
and is faster and better
and shinier and happier
Is this a myth?
Is this real?
What do you guys think?
I think it's a bit myth and a bit real
right like
I think SRE teams
and I jumped in ahead of Vlad
because I know he's gonna have a ton to say about this
because I think platform teams
like the SREs that have established
a developer experience that they own
especially when they get the abstractions right
can change out everything underneath it
and nobody ever knows right
like if the abstractions are really solid
that's exactly what they're for right
and so if you have Kubernetes in place
you can swap out the Kubernetes fleet a lot of times
even entire clusters
and nobody even knows
sometimes the software engineers
don't even know you did it
so I think like it is real
but it's like it has a scope
right that it's pretty rarely
in my experience spans past
the infrastructure layer
and then once you get up
into application tier
obviously now we have so many product requirements
involved critical user journeys
and things like that
that you're back into a plain old
software engineering effort
to go and like
change out components of the software
and move it and move it around
and that's where the SRE comes in
more of an advisory role
and I do this a lot these days
where I come and help a team figure out
how they're going to do like Stringler pattern
right which is where they basically like
grow a new API
move all of the traffic over
and then kill the old one off
when it basically has no more light
and so like that's usually what I do
is I go just help people reason about
like here's your two-year plan
to get off this you know
very old crusty platform
with million edges
that are going to take forever to file off
you know and here's how you can do
load balancer tricks
here's how you can do various
you know whatever the tricks are
we bring to the table
usually we have those
especially for the infrastructure
to stitch things
so that we can do that seamless swap
yeah and what I found is that
this kind of hot swapping the tires
on the Ferrari when it's you know
racing on the track
that kind of happens often on
I say more often on less mature teams
in terms of operations
on the teams where you've got more
maturity that happens
I'll say kind of rarely
vous avez dit que c'est rare
ou c'est juste constamment en motion
mais plus petit change
oui oui c'est vrai
je pense que ça serait la maturity
c'est vrai que c'est comme
un équipe immaturé
va faire des grandes changes
qui sont dangereuses
et quand je vois une très high performance
un équipe immaturé
il y a toujours
chaque spray
ils se sont en train de pushing
un change incrementaire
pour se changer
oui
bien
donc maintenant
les questions sont les plus belles
de l'interviews
j'espère
qu'est-ce que vous voulez voir
arriver
dans les prochains deux années
quand il s'agit de
les SREs autour du monde
comme les gens qui identifient
les SREs
comme les officiels books
qu'est-ce que le grand changement
vous voulez voir
ou le même
meilleur
comme plusieurs petites changes
c'est vrai que ça n'a pas de grands changements
je voudrais voir
notre salarié double 10 fois
ça serait bon pour moi
de toute façon
quand quelqu'un vous donne
plus de monnaie
vous dites oui
vous vous souvenez de cette bouteille
ou de l'équipe et de la bouteille
oui
je pense que si on a demandé
tous les SREs dans le monde
si ils allaient aimer ce plan
je leur ai juste vécu
je pense qu'ils auraient aimé
mais probablement nous avons besoin
d'un petit peu plus de pratique
pour avoir le business
pour le faire
je pense que le march toward
le plage de l'engineering
va être vraiment bon pour les SREs
je pense que
plus souvent
l'entreprise
l'entreprise tend à être un peu plus
un peu
qui se fait suivre
où l'air est déjà gone
c'est ce que je veux dire
quand je dis l'entreprise
c'est pas
typiquement
adopter la façon de faire les choses
de la bleue
mais ils ont déjà
ce concept de plateforme
beaucoup de fois
c'est
ça sort de l'IT
et donc on sent bizarre
sur ça
mais je pense que c'est une vraie
opportunité ici
de construire
ce basis
d'influence
parce que si vous venez de la plateforme
les SREs
typiquement ont l'expertise
de construire ces plateformes
et de construire elles de la bonne manière
et ça nous lead
à l'influence
dans la communauté de la leadership
pour commencer à avancer
plus tard
et plus tard
pour les pratiques de la relayabilité
donc si j'ai un changement
je veux juste continuer
à partir de cette route
nous allons
et juste de continuer à se hammer
pour voir où nous pouvons
enlever
pour enlever plus de SREs
pour pouvoir
les contrôler
avant que c'est déjà en production
oui
donc ce que j'aime voir
dans le sereau
c'est
juste
faire l'application
de l'AI
pour des tasks banaux
donc si vous êtes en SRE aujourd'hui
vous devez avoir
beaucoup de choses dans votre tête
où sont mes sereaux
ce qu'est le service
ce qu'est le service
pour les services
dans cette région
et tout ça
pourquoi ne nous faisons pas
juste de
avoir un bon chat
qui peut me donner
toutes les réponses
comme ça
et aussi
imaginez comment
l'envoi de la SRE
sera simplifié
si la SRE
n'aurait pas besoin
de
mettre le
le corps
à la prochaine table
mais juste
on va pouvoir
avoir une conversation
avec un
absolument
standard
ce type de chat
qui saura
ce genre
bien
cool
cool
et puis
en parlant de
nouveaux employés
quels conseils
vous donneriez
pour
pas
des nouveaux employés
mais
même avant ça
comme
la SRE
les gens
qui
peuvent regarder
dans cette direction
comme
A
pensez-vous que
personne ne sait
que cette carrière existe
au sein de
vous savez
quand vous avez été
commencé
dans l'industrie
mais
B
comme
comment les gens
seraient best
à être en train de
prendre leur temps
si ils veulent
de la direction
de la SRE
et de la réliabilité
je pense que
la main chose
je dirais
ces gens
est
il y a un genre de travail
qui est
bien
de l'A.I.
vraiment
juste
un que nous avons
une bonne notion
va être
pour un très long temps
et c'est
le travail adapté
où nous sommes
dans le territoire
de
des connaissances
et des
connaissances
ce sont les places
où A.I.
juste commence
à se couper
de l'autre
et où la créativité humaine
et
la fin de la patte
commence
à être
plus
supérieure
comme
je pense même
que les modèles
je pense que nous continuerons
à être
plus supérieurs
à ce point
donc
de cette perspective
je pense que beaucoup de jeunes
folk
sont en train de se demander
aujourd'hui
comme
qu'est-ce que je peux faire
que ça s'est
de la machines
et
ils devraient
se demander
pour ça
parce que
nous sommes en train de voir
une vraie innovation
dans le space
je pense
où ça va
je ne vais pas
le décider
mais
ce qui va rester
est que
les compétences
vont se passer
dans des situations
qu'elles ne peuvent pas
se résoudre
plus
et ce sera
la wall
où
c'est tout le monde
qui a déjà été
fait
et donc
ça nous laisse
ça nous laisse
un réveil
c'est-à-dire
et quand ça ne peut pas
faire
dire
on a
on a
une application
d'un vendeur
et
c'est
détenu
un peu
de la
des
des
choses
comme
Vlad
qui
est
un peu
de la
fruit
de la
qui
est
toilette
devrait être
fait
par les compétences
ce qui est
en
sont les
non-non
non-non
et
les non-non
non-non
des non-non
non-non
non-non
non-non
et
et c'est
où je serai
de la
c'est
comme
être flexible
être adaptable
être
ce qu'on se considère
Auckland
hein
s'asÉ
arrive
prodyst
de la
c'est
est
qui sont vraiment autour de quelque chose.
Et pendant que vous étiez en train de apprendre quelque chose à l'université pour la partie de la défaite de la histoire,
vous n'aurez probablement pas de la chance d'avoir le temps pour la partie de l'op de la histoire.
Et vous pouvez apprendre la partie de l'op de la histoire,
en définitive, dans une société qui fait des opérations.
Donc c'est l'advice que je voudrais donner.
Et puis, juste pour vos instincts.
Si vous êtes intéressés, vous pouvez vous en purser de l'interesse.
Et c'est comme ça que les séries sont là, hein?
Steve, combien de collègues de votre vie sont les majors de physique à l'université?
Havent-ils?
Je vais dire plus de 15%.
10%.
C'est comme un numéro magnifique, c'est vrai?
C'est une philosophie.
La philosophie, la philosophie de la physique.
C'est la chose qui me lead les gens ici, c'est la curiosité,
et de voir le monde réel.
Et dire que c'est la façon dont les choses travaillent, c'est où je veux travailler.
Ou je dois être payé.
Oui, mon dernier boss a été majorisé dans le drame, je crois,
ou dans les étapes théétriques.
Donc, il y a...
Je suis majoriste.
Il y a...
Cool.
Ok, bien, merci d'être venu aujourd'hui.
C'est une grande discussion.
Avant de nous signer, je pense que c'est customaire
de vous dire où vous pouvez être trouvé sur internet
si les gens veulent vous suivre.
Et peut-être comme un final take, si vous êtes si incliné.
Un petit coup de partage peut-être.
Vous pouvez me trouver sur LinkedIn.
Et...
Bien, payez attention à l'observatrice.
Bien.
Je peux être trouvé sur...
Mastodon à Renus, donc comme le commande, Renus.
Renus à hackyderm.io.
Et je peux être trouvé sur LinkedIn aussi.
Et mon take est...
Je pense que où tous nos travail est en train,
c'est ce que dans la communauté résilience,
on appelle l'adaptive travail.
Et c'est ce qui reste quand tout est automatique,
ce qui est que les compétences que nous n'avons pas prédictées
ou les compétences ne sont pas prédictes.
Et donc, vraiment, le site de la communauté résilience,
un peu de la manière dont nous sommes adaptés,
l'adaptive de la communauté résilience.
C'est génial.
Merci beaucoup.
Et comme on le dit,
peut-être que les queries sont en train de se faire s'inspirer.
Merci.
A plus.
Merci beaucoup.
Episode suivant:
Les infos glanées
GoogleSREProdcast
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Tags
There Remains a Huge Amount of Work to Do, with Healfdene Goguen