The One With AI Agents, Ramón Llamas, and Swapnil Haria

Durée: 42m7s

Date de sortie: 23/07/2025

Google Staff SRE Ramón Llamas and Google Software Engineer Swapnil Haria join our hosts to explore how AI agents are revolutionizing production management, from summarizing alerts and finding hidden errors to proactively preventing outages. Learn about the challenges of evaluating non-deterministic systems and the fascinating interplay between human expertise and emerging AI capabilities in ensuring robust and reliable infrastructure.

Salut tout le monde, bienvenue à la fête de la fête de la podcast.
Google est un podcast sur la compétition de l'engineering et de la production de la

Je suis votre host, Steve McGee.
Cette fête est de nos amis et de nos taux de la France.
C'est tout pour ce qui est venu dans le space de la SRE, de la nouvelle technologie, de
les processus modernisés.
Et bien sûr, la partie la plus importante est la fête que nous avons faite.
Alors, bonsoir à tous et à vous de vous rappeler, j'espère que ce n'est pas une stratégie.
Salut tout le monde et bienvenue à la fête de la podcast.
C'est un podcast sur Google sur la production de la SRE.
Je suis Steve, Matt est de la fête.
Il est Matt.
Hi, on a plus de gens aussi aujourd'hui.
Je pense que nous allons parler de nos amis et de nos taux de la France parce que c'est le thème.
Mais pour le moment, nous avons deux amis de la fête.
Nous avons deux nouveaux amis aujourd'hui.
Vous pouvez vous interpréter?
Hello, I'm Ramon.
I'm a senior staff, site reliability engineer.
I've been at Google for like a long time, like 12 years.
And I'm in a team that is called Core SRE.
So we run infrastructure like, you know, from authentication services to data, the stuff.
And we have been working now for like a year or so in building agents for production, for production management.
Nice.
Hi, I'm Swapnel.
I am not an SRE.
And like most of us here, I am a software engineer.
So I have some experience with production systems.
I work in Core Labs, which is a part of Google.
That's building AI agents for different use cases.
And I started at the bottom of the computing stack.
So my undergraduate degree was electrical engineering.
So doing VLSI design.
Then I moved to computer processors.
Then I moved to operating systems for my graduate school.
Then databases at Google.
And now finally, like coming up full stack to agentic use cases.
Are you going to do UX next?
And then like psychology?
Is that how this works?
Yeah, it doesn't do it before me.

Cool.
So you guys both work in Core, which makes it sound like you work at the center of the planet.
But I'm guessing that's not really the case.
But you guys, you guys work on systems that are used by other systems, right?
They're kind of like back end to the back end to the back end.
So that seemed right.
Yeah.
About that.
Like infrastructure that is used by all the robots.
There is as well another thing that is like, you know, DI, we call it, which is the back end to the back end to the back end.
It's off the back end.
Very low in the stack.
TI is a technical infrastructure, right?
Like it's a, it's another one of these like set of words, which are pretty generic, but they, they make very specific sense inside of Google.
People running the metal basically.
That's right.
Cool.
So the other phrase you said or the other word you used was agents or agentic and stuff.
And this is, this is where the trends come in, right?
This is a pretty popular new term these days.
I think a lot of people have heard it, but it's sort of like comes with a question mark next to it a lot of time.
They're like, are they here about it?
I'm feeling you guys are a little bit deeper than that when it comes to actually knowing what it is.
And then like using it, maybe, and working on building them and stuff.
So can someone start with like a high level?
Like, what are we even talking about here?
So I think the right way to talk about agents is to build incrementally.
So we have our static deterministic algorithms, right?
Where it's a set of sequences.
You have an input.
I can work through with pen and paper.
How the operations are going to happen, how the input is going to be transformed into the final output.
So that's a static algorithm.
Now we have slowly we are seeing cases where some parts of this algorithm could be outsourced to an LLM.
So for example, you know, I could have a doorbell camera company and I want to have a specific sound when the mailman comes just so that, you know, folks are aware that there is a package.
So I could say if mailman in frame, right?
And this, if mailman in frame is being executed by an LLM and giving us the response.
So we have replaced some parts with some amount of non-determinism.
And then we go to the other end, which is a full agent, where there is no script anymore.
The agent gets an input.
It has some high level understanding of how to solve the problem.
But it comes up with its own set of sequence, like steps dynamically on the fly.
At each point, it thinks, OK, do I need more information?
Should I do, you know, a function call?
Should I invoke one of these good old fashioned static algorithms instead?
And over a series of steps, it comes to an answer.
But there is no script.
There is no input.
So there's an input and an output, but we can't manually see how it flows
from the input to the final output.
Oh, that sounds a little out of control.
No script.
I mean, how are we going to get to our goal here?
I'm a little a little skeptical.
I mean, you just establish a goal, right?
And then you can see the trajectory.
We call it that agent is following.
Because I mean, an agent could be just a call to a singular call to a model, right?
You just prompt an LLM with a goal.
That could be an agent, right?
If your goal is like calculate the inflation in the US for the last five years,
right, that's something that might have memorized or give you like some code to run,
right?
And that's it.
That's a very simple kind of way to see it.
And then you can go to, you know, system that they are like way, way more complex.
They are super untraceable.
Well, not untraceable, but they are difficult to debug and trace.
What is what they are?
They are going through to achieve the goal.
OK, so I'm getting a read here.
The agents are have a goal directed.
Perhaps have an unexplainable set of behaviors that they have to get to that goal.
And you're going to give them a lot of capabilities.
And I'm hoping you're going to tell me what these capabilities can be.
How do we define these capabilities?
Are they totally open-ended?
It depends.
So there are tools that you can serve me mention some of them, like you can call
like a tool that is built in your agent.
Like, for example, if you open jaminai.google.com, there are tools
that are already transparent for you, like Google search is one, right?
It calls a search as a tool for continuing the trajectory to a goal
that is the prompt that you put in there.
But then you can have the agent generating code on the fly
and executing it in a sandbox.
So when you give a prompt to the LLM, right?
What information does it have?
It has the prompt and then it has been trained to be a good, you know, large language model.
So there was some amount of training data that was fed to it.
And this data has been getting bigger and bigger over time.
But let's say it was trained on, you know, March 25th.
So at that point, its knowledge of the world is limited to what was, you know,
in its training data at March 25th.
And if I ask it, things like, OK, should I go out today for a walk?
It has to figure out some dynamic information.
So that's where tool calls come in to interact with the external system,
trying to get more up to date information and be able to generate
additional observations to answer the question.
And if you just have a few different tools, maybe it knows
what where my location is, maybe it can figure out the weather,
maybe it can see do I like walking, right?
Is there a good park near me?
So slowly, you can see how we can add capabilities
over time to give a more refined version of the answer.
And in some cases, we might not even realize
what the capabilities that are going to emerge are, right?
So we'll give it a set of tools, but it is mostly allowed to do,
you know, whatever it wants, until a set number of steps
to give us the final answer.
And that's where some of these emergent capabilities
that people talk about, where things just emerge,
where it realizes that with these combination of tool calls,
it can actually give us a much more, you know,
better version of the answer to a question we didn't even ask.
That's where it gets surprising, but also more challenging to evaluate.
So it sounds like the two categories of these tools
or our agent capabilities are getting at some context
that wasn't in my training data.
And the other one is like doing stuff,
besides just saying words, like going out and articulating something
or calling another agent, for example,
would be one of those or actually like executing some code
or something like that.
Does that sound right?
You can see how there is some capabilities
that are very evidently simple, like calling the weather service.
Or if we are in production and you have an agent
that is managing production, it's like reading some time series data
from your promis fuse or whatever you are using, right?
But there are others like a world modification capabilities
or capabilities with side effects,
like, I don't know, if we are in production,
like trigger a binary rollout, right?
So the tools and the capabilities
that you give to the agent, right?
You need to know what you are giving it
because the knowing or predicting
how and when they are going to be called is not that trivial.
Yeah.
Yeah, with great power comes great responsibility, right?
Yeah.
Is there a notion of a read only versus a write only agent?
Or how do you put boundaries or envelopps around your agents
or to talk about the sort of layers
of responsibility and capability?
Because I kind of presume,
I don't want to set it free in my organization
without giving it some of this sort of some ACLs of some sort.
Yeah.
So there is a few ways, right?
So typically, at least in the agents that we built today,
we don't allow them to make any kind of world modification
any ways to mutate the state of the world.
So write actions are typically very restricted.
You see a lot of examples of coding agents
and the way these work is it's happening inside a sandbox.
So you can edit any code you want inside the sandbox.
You can run any tests.
So you are modifying the state of the world,
but only inside a sandbox.
And anytime you have to break the sandbox illusion,
if I want to make, let's say, a right,
you know, request to a spanner database,
or to a database,
you have to enforce some additional checks.
In our case, we try to get human permission
before it does anything.
So it will recommend that you do this
or it'll say, can I go ahead and make this call
and there'll be a yes or no option.
In some cases, you'll say, OK,
can I see some additional rationale
for why you want to make this action?
But you need to have additional guardrails
in place for some of the right actions,
that as we call it.
This is a common pattern.
You see it in tools that they are,
for example, in cloud code,
that is a CLI to do coding in your computer
and your workstation, right?
It has some safety parameters as well.
So I think the agents that we have seen,
both in large production deployments,
but in your workstation
or in things like the Jaminai app
or in chart GPT and these things, right?
Did you see how this differentiation
between the way we are only accessing data for reading
and trying to execute an action,
like booking you a restaurant or whatever?
There is still a lot of human approvals for all of them.
Well, let's move into production in an SRE spaces.
So now that we have a basic idea of this model
and this way of working,
which I appreciate, I think that was really well done.
What do we do with it?
Like as SREs and people tending production,
how do we take these generic ideas and capabilities
and apply them to making a more robust and reliable system?
What have you seen work so far?
One thing that we have seen
that it works really, really well across the board
is what this call one-shot summaries, right?
Or one-shot responses.
So you get in production,
like give the agent access to many data sources, right?
Which is non trivial
because in an enterprise,
integration of data access is not an easy feat, right?
Because of permissions, different schemas,
different things and so on.
Let's say that you give the agent
like a bunch of access to your prometheus data,
you have access to your logs,
you have access to your things, right?
It can summarize the situation really well for you.
For example, we have seen that work in really, really easy.
So when you get an alert, for example,
it's gonna be, you have an attachment that is A.
So the situation of the world is like that, right?
So you have this correlation on that, right?
So for example, in the case of troubleshooting,
it saves you like all the like 5, 10 minutes at the beginning
to get the situational awareness of like,
okay, what's this alert?
And what is happening kind of thing?
You get that done, right?
You need to do the troubleshooting yourself, right?
But I think that works really well.
All the things that we have seen that they were good
and they are starting to show a lot of promises,
like do reasoning on top of that.
So it's helpful to think of different LLM capabilities
and see how we can apply them to the SRE use case, right?
So the easiest Ramon mentioned was summarization.
When the alert happens, there is so much information
that at least at Google, an engineer has access to.
So things like what was,
how many incoming requests were coming into the system?
How many outgoing requests were going out?
Are there any outages above my service
or downstream below my service?
There is thousands of lines of server logs
that you have to go through,
thousands of lines per minute or per second in some cases.
So there's a whole lot of information surrounding an alert
and just processing them is quite difficult
as usually as a junior on-call or at least.
You have to quickly parse out what is useful,
what's not useful.
So this is one way in which an LLM can help
where it's summarizing all of it
and weeding out the weed from the shaft in some ways, right?
Keeping the important bits,
removing all of the extraneous information.
So that's one.
Then slowly you can get into more advanced things.
Like if you have a lot of server log lines,
there might be an error hidden inside somewhere
that looks a little bit like your outage error message.
How can it find that quickly?
So Gemini has or chat models now
have access to large context windows.
So you can dump a lot of information in there
and say, okay, do you see anything
that resembles the error message I'm seeing?
So you can use it to find a needle in a haystack
if you wanted to.
And then lastly, Ramon said reasoning.
That's where we are trying to build
more advanced intelligence in some ways
or at least leverage that.
So if you look at the SRE,
they are doing two things.
They are trying to gather clues
from all of these diagnostic systems
and they are coming up with hypothesis.
The LLM can do that for you.
It can look at these systems.
It can figure out, here's a playbook
of how this SRE would have diagnosed this issue.
I can repeat a lot of the steps.
If you're looking at,
what are the upstream error messages looking like?
Is there a spike in QPS somewhere in the system?
Am I seeing that there was a binary released
in one of my dependencies?
It can take all of this information,
do a little bit of debugging,
surface them as higher level insights
and then the human can quickly make sense of them.
And in this case, we saved a bunch of time
because the human didn't have to click through,
15 screens trying to find the right log messages.
They didn't have to add 16 filters
to narrow down to the right cell in the right region
for the right service,
in the right environment and so on.
So it does a lot of the legwork
of debugging for you.
Saved you a bunch of clicks.
I get this.
Do you have a particular implementation or launch
that you'd like to talk about
that you feel really great about sharing
that you can share that uses an agent
that does save a lot of production time,
a lot of toil that you think is specifically demonstrative
of how agents can save SRE effort
or automate it out or amplify.
I think amplify is the right word here that you're saying.
Using the intelligence of humans to greater effort.
I think that would really is straight
where you're talking about it well,
as well as like in a risk averse way.
So we have alerts firing all the time at Google, right?
An alert is it looks at an error condition.
So if your QPS is above a certain threshold,
maybe you're returning too many errors to your clients,
those are error conditions.
And if they stay for some amount of time
and alert gets triggered automatically at Google.
So when the alert gets triggered,
we have some use cases where the agent steps in first, right?
By the time the human gets to their desk,
which is typically 3, 4 minutes,
or they are context switching away from working on something else.
The agent has already done a lot of the common steps
that the on-caller would have done.
So for example, was there a release recently?
Is the new release showing higher error rate
than the old release?
So things like that,
it can quickly weed out a lot of the common medications
and it can either say, hey, look,
I found the underlying root cause
or it can say, okay, I didn't find the root cause,
but I've ruled out these 16 things.
So now you can actually go and see
what is really the issue there.
And in doing this,
we've seen that it can save a lot of time,
a human time, right?
When they get to that incident of trying to figure out,
okay, how do I find the right context for this alert?
How do I find the right mitigation or root cause?
And it's still giving them a lot of agency
because we are showing them what the agent has done, right?
And they have to make the call, okay, this looks right.
Here's the agent supporting evidence for that.
And if it looks right,
here's how I can apply the changes directly.
So that's the process in which we are using agents right now.
The thing that we are speaking about saving
like a few minutes of time and so on
for the responder and so on,
but what matters is that when you have an alert, right,
you have an SLO that is like,
or an SLO that is under a threshold, whatever that is, right?
That typically means that some service is bleeding out
like some reliability, right?
Some users are affected, whatever that is, like,
in a company there is a revenue like impact,
there is some like, whatever that is, financial
or like reputation or whatever, right?
It's shortening these minutes, right?
What matters is that this impact goes away, right?
In the sense of like, yes,
you are gonna have to have humans
at the end of the day looking at what happened
and so on, probably.
The agent is gonna give you just a mitigation action
that you can take right?
And then the root cause is something
that you need to analyze and fix later, right?
This reduces the amount of time
that your service is not available for your users, right?
Which is really what is around,
especially when you're running a very large service,
like, you know, but sort of charge email,
whatever that is, right?
We do talk a lot about anomaly detection
and the improvements that AI
have been giving our responders, they're real
and the interventions are still at the human level,
but that the feedback that we're getting
is faster and faster
and that the dashboards that we're using
are more informative and less toily.
And I think what you're saying is the agents
can take more action to go retrieve this information,
consolidate it, make it more concise
and make it more actionable to me as the on-caller.
And like, that just sounds like,
that's a wonderful, wonderful thing
that I want to have every day.
And like, so, okay, fine, we have these agents.
How are you actually deploying, testing
and finding out that they're doing those things correctly?
Cause I think that sounds like a dream.
Do they work?
Does it work?
And how do we test that?
And what are we testing against?
Cause this is against something else.
This isn't just something you make.
And then we go, that's great, thanks, good day.
What's our A-B here?
Works on my machine, right?
That's the golden question, literally, right?
How to evaluate these things?
The evaluation of these new paradigms,
like services that they are not deterministic in a way, right?
It's very hard because, you know, there's two reasons.
So the first thing I want to say is something
that Swapnil said at the beginning,
that is when you are doing a development agent,
like for coding, like, you know, cursor,
this kind of stuff, right?
You have a sandbox, you don't need,
I mean, you need to evaluate,
but you have the luxury of being on the,
I call it the left-hand side of the development cycle
because you spin on the sandbox, right?
Works as a scenario, it's like, you know,
just compile your application with the CL
or the patch that is producing,
and that's the test, right?
If it compiles, you know, passes the unit test,
it's like, you know, you have an evaluation there.
When you are in the right-hand side in production, right?
And you have agents that they might take actions in production,
there is no production, it's not a sandbox,
and it's not, you know, if you decide,
like, hey, this agent is gonna, you know,
drain every single cluster or every single data center
or, you know, whatever you are deploying, right?
In one go, because it's what it is,
it's gonna just go for it, right?
So I think the evaluation there is something
that we are being very conservative
in the sense of, like, there is all the safety, right?
But there is as well how we measure
that the agent is producing recommendations
or insights that they are correct, right?
So there we have techniques,
like, there are techniques like auto-writer,
there are techniques like having golden data, right?
And there are techniques like using
the human feedback for this, right?
Oh, yeah, Swamil knows these very much better than me.
There was a paper in 2021
with a very provocative title.
It was called, Everyone Wants to Build Models,
No One Wants to Clean Data.
And I think that's very much applicable today,
because like you said, writing some prompts
and having something that kind of works
for a demo is very easy,
but then showing it that it works reliably,
90% of the time is very hard.
And the problem with, at least internally,
at Google is for production use cases or outages,
we don't have a lot of good quality data
of what actually happened.
So we might have, you know, 1,000 incidents per day,
but in 99% of these cases,
we don't have any idea of what the on-callor
actually did to fix the issue.
We don't have good documentation,
we don't have good workflows.
And what we want is, we want to figure out
what the human did,
because that's the right benchmark for the agent.
But there is a lot of work that happens
at the beginning of these projects
where we figure out what is the right evaluation set.
So in our case, we had a bunch of alerts
and we had the right labels for each of them.
So label was something like, you know,
we rolled back the binary, we upsized the cell,
we took some emergency quotas,
we throttled the user, things like that, right?
These are golden labels that you can have.
And for each incident, we need to find out
what was the actual action that fixed the issue.
We get a set of that
and then we make sure that the agent
is evaluated against that.
So agent's output is checked with the actual labels.
Now this is easier said than done,
because the real world data that the agent
is looking at, dashboards and things like that,
it often gets lost over time, right?
It might be retained for let's say 20 days, 30 days,
but beyond a certain point,
you lose access to that old data.
I no longer know what the log said 30 days ago.
How do you keep evaluating in that sense?
So you need to keep generating a large quantity
of golden data to make sure
that the agent is continuously being tested.
And you have ways to close the loop here.
So we mentioned that the agent
gives you some preliminary recommendations
and the on-caller then makes the final call saying,
okay, this is what I need to do.
We could easily add a step
where they record that final step
so that the agent knows it too
and we can have a feedback loop
so that the agent can improve over time.
But that second part, right,
of getting the golden data,
what did the human actually do for an outage?
Or for whatever your use case was, right?
What's the perfect response for a given question?
Getting that is very difficult
and in often cases, it has other problems,
retention of data and other stuff associated with it.
Okay, so we went over how these agentic systems
can be useful for production in general.
And before that, we talked about
what are these capabilities that an agent might do.
We talked about tools and reaching out
and doing things and something like that.
But now I want to go a little bit deeper
as to how to combine these things.
So my guess is that you don't just write a prompt
that's like, you are an SRE.
What would you do in the situation context dump?
Like, here's a giant, here's 20 gigabytes of logs.
What is the process that you guys have actually gone through?
Or you've actually, like,
and the other half of this,
which I think is this is a joke to kind of tell a story,
is like, I'm guessing you're not like training new models.
You're not starting from scratch.
So like, what's that thing in the middle?
What is it that you're doing where it's not just like
a paragraph of like simple prompt
and it's not like 20 years of developing models?
Like, what's the thing in the middle that you're doing?
Like, what's the complexity
that you guys are actually building out?
Just walk us through one of the, like, you know, some of that.
Like, how would you actually go about doing it?
So it's important to start with the end in mind.
So for example, the first step in any of these agent building
workflows is to have something called as a golden data set.
It has the input.
So it might be things what could users ask the agent
and it has a perfect output or a good enough output
for that use case.
So let's assume we are building an agent
that can answer health insurance questions.
So you would have some common things about, you know,
is my health insurance valid if I go outside the country?
So that would be, you know, in one of the questions
and then the answer would be whatever
the appropriate response was for that.
So you start with that even before you have any agents
of any kind.
So this is like labeled reinforcement learning type stuff,
right, where you have like...
Yeah, labels.
A data set and this is a duck, this is not a duck.
Exact.
But it's slightly more nuanced
because it's not a yes or no thing, right?
The output is actually a large paragraph in many cases.
So evaluating whether your large paragraph is equal
to my large paragraph of some other agent generated stuff
is not an easy problem,
but let's not solve that right now.
So then the first thing to do is as a human,
how would I go about answering this question?
In the production use case, if I get an alert,
what are the tools that I use,
what are the dashboards I look at to try to debug this?
Because the LLM is not magic, right?
If we don't give it the right context,
it's not going to give us anything useful.
So we start from there.
When we have the evaluation set, we figure out,
okay, these are the three or four tools I need
to get a rough understanding of the system.
Let me teach the LLM how to use that.
Let's see how the answers come back.
Now I compare this with my ideal response
and I slowly try to remove the variance over time.
And is there a name for that process
that you just described?
Is that HLRF or RLHF or one of those letters?
Is that the same thing or is that something else?
So what you said is one of the techniques.
The overall objective is called hill climbing
because you can think of it as trying to get
to a local maxima, right?
And so I trade of hill climbing is one,
is an umbrella term for this and RLHF,
like reinforcement learning through human feedback
is a technique that you could use.
And so this is not training a new model.
This is taking an existing LLM
and then you're refining it or distilling it.
Is that another word that you guys use?
That's a different word.
You can cut that out.

So distilling is getting a, you know,
model that is very large,
like Gemini pro, blah, blah, blah,
that is, and then making something smaller.
So it's not gonna have the same quality
but might retain 90% of the quality
in half the size kind of stuff.
So we just use the context.
So what we do is in context learning.
So what we are creating is like the context
for the prompt, right?
It might be a loop of multiple prompts
to the same model with different information, right?
But the models that we use is Gemini, vanilla, right?
You could be laughing but at the beginning
but the first prompt that we had,
it was like you are an inexperienced Google,
sorry that he's doing blah, blah, blah.
Gotta start somewhere, yeah.
Wow, it's like a venture game.
Yeah, go north, yeah.
Yeah, that's a tool call, right?
So the, and then another resource
that is very interesting
because the things that we learn
on the way is that so getting the golden data
for the starting is relatively,
it's not hard, it's just tedious
because you need people to help you, right?
It's like, hey, so we were asking colleagues
across the company, it's like, hey, so we need these,
for this seller type that we are working on,
we need the dataset to test, right?
So can you fill in this spreadsheet, right?
And people go like, I wanted to do AI, right?
People loves filling in spreadsheets, yeah.
Yes, no, no, no.
So that's for starting up.
Then I think that once you have some corpus, right,
and you have an agenda that is running,
I think there are ways to get online feedback, right?
So you get feedback from the users
and then from the alerts themselves, right?
You can see one comparison as you see,
it's like, you know,
Agen recommended you to do whatever,
draining your data center, whatever,
I don't know, from mitigation, right?
And then the actual response that was taken,
you know, after the fact, it was the same, right?
So we know, like, for example,
this is a good match in there.
So when there's these matches,
you can compare to, say, like, oh,
the agent said like to do somewhere something, right?
And then the person that was leading the response
did something different, right?
Okay, so let me see if I get this right.
You have the idea of sort of the training,
so not the training of the model,
but the reinforcement feedback learning,
some of those letters around,
like a historical knowledge base of,
people encounter these problems,
they took these steps,
and so we're building up like some context here.
And then I think what you're saying also
is that once the first version of this is running,
or once an early version of this is running,
it can then also see its own actions
and that's impact as well.
So, like, I chose to drain the offending
or roll back the offending, you know, binary,
that was bad.
And lo and behold, the error rates
dropped back down to normal.
Is that also, does that go into another feedback loop
of some kind, or is that beyond what we're talking about?
So this is related to what Ramon was saying earlier about.
You could do this for sandbox agents,
so things like coding agents,
where you can actually test it out.
But in production, I'm not sure
if you want to let your, you know, rookie agent
train a cell to see if that was the right hypothesis, right?
So that loop kind of breaks down
and that's why we have to do it artificially,
after the fact.
So what we do is we get the responses
at the time of the incident, right, from the agent.
And at Google, we have this culture
of writing post-mortems for large incidents.

Now, they are written in a specific format.
We have to do the hard work of converting it
from this human readable format,
which is very verbose,
to a format that we can easily compare
the agent's output, which would be, you know,
this was the exact mitigation.
So you have to take those two pieces of information
and see, did the agent get it right
or did it miss some step?
If it missed a step, did it call a tool wrong?
Did it interpret the output of a tool wrongly?
Or in fact, in some cases,
did it not have the right tool to call, right?
And there are a lot of different error scenarios here.
So in one example that we often, you know, quote,
is there were a bunch of tools
that our production agent had access to
and all of them were returning dates
in different time zones.
Or, so one of them would say the alert happened
in UTC time.
The other would say, you know, in mountain time.
And this was well known for humans who are used to these,
but it was not obvious to the agent.
So we have a specific line in there
where we say, OK, all the time zones
are in mountain view time zone,
which is the headquarters for Google.
And we made sure we had wrappers
around each of our systems
so that they would always return things
in mountain view time zone.
So there's a working theory that postmortems
are the most nutritional ingredient
to a healthy esary culture.
And I think you've just proven that to be the case
through technology, and I really appreciate that.
I'm sure others have proven it before me.
But you did it through like an actual, like an experiment.
And I really think that's a wonderful outcome here.
It's a super great training data
because or adjusting or golden data for two reasons.
It's not only that you have a response of, like, you know,
this is the mitigation that we have, right?
You have the trajectory of the person, right?
The beginning postmortems.
And you can see them in the esary book,
the people are dislistening, right?
We have example postmortems,
and you have what is called the timeline, right?
So you have all the steps that the person took, right?
So when we are inspecting the agent output,
what we're seeing is exactly that timeline
but produced by one of the agents.
So you can see, like, so I'm calling this tool
and I got this data.
This is a red herring.
I'm calling this other tool and running this other data.
So this is promising. Let me follow that.
Which is what you see when we are responding to an incident.
One of these larger incidents
where you have a chat for the response
and all these things, right?
You've seen a trajectory
that is like a human trajectory
for doing what the agent does, right?
That we're reflecting a postmortem.
But this is fantastic because this is...
One piece of data we need is not only the action
that was taken at the end,
so the correct mitigation action is X,
like, you know, it's draining or whatever,
but the trajectory of steps, right?
That's something very important, right?
And I think that's something that's containing
the postmortems all the way.
Have you ever proposed improvements to postmortems
from ad hoc...
Was it post hoc analysis of work
that you can derive from things you've seen?
Format, for example, the format.
The format of the postmortem
is something that we can open formats like, you know,
we have many postmortems in Google Docs and so on.
It's difficult to mine for reducing the ingredients, right?
And then there is all sorts of...
One thing that you discover working on this
is that the majority of the work that you are putting together
for working in AI is not in AI.
It's integrating different tools, different data access,
so accessing to different postmortems
in different formats for different teams
that they are mighty old, for example,
postmortems from last year takes a while to, you know,
like, curate and process all of these, right?
So, for example, just formatting the postmortem
in a human or a human and machine readable format.
And so they are all the same, right?
That gets you like a head.
It's fascinating. Thank you. Cool.
There's also...
We're trying to build a common language
of sorts for some of these mitigations.
So we have a taxonomy, right?
So, for example, if I ask you what are some common production
mitigations that you would take,
you would roll back a binary,
you would add resources and so on,
roll back experiments of some kinds.
Now, the thing is, we want to build a common taxonomy
of mitigation so that when the agents
has do this, all of us, regardless of what product
we are working on, we know what that, you know, action is.
And it turns out, one of the problems that we've faced
is a lot of different teams have different words
for the same thing.
So in our initial version of our taxonomy,
we had escalate, which would mean that, you know,
you go to a different service
because some other service is returning you error,
so you escalated to their on-collar.
But the incident reporting system we had at Google
used escalate for a very specific reason.
And so when we actually tried to, you know,
expand the usage of this taxonomy,
they said, okay, hey, no, this doesn't work at all
because, you know, some people thought that they need to
increase the severity of this incident.
And so we had to move to delegate.
Same with experiments.
Different teams consider experiments
to be different things.
So when we say rollback and experiment,
they might understand it differently.
But we need this common language to be there
in all of the postmortems so that the agent,
which is the same across these teams,
can improve, you know, consistently.
There is a nice publication that is called
Generic Mitigations that we made public,
like, I think it was like a few years ago, right?
It has already 20 of them, right?
So I recommend people to read that
because this is what we are referring to.
Yeah, actually one of the first episodes of the podcast
was with Jennifer Macy,
and she talked about exactly this.
So we can definitely cite that one
in the show notes as well.
I have a kind of a side question.
We've been talking about the bad day,
like the incident day and response
and all this kind of stuff.
Like, what about the other days?
Like, can we do something with agents
that helps my daily work when I'm not panicking
or when I'm not helping people who might be panicking?
Something like that.
What do you think?
So we are looking into two places
that the agents help because, you know,
luckily, the incident day is not every day.
So it's actually the minority of the days.
It's when it's stressful and, you know,
like things like you need to, you know,
be prepared for that.
And it's as well the evident kind of,
it was, for us, at least,
it was an evident case at the beginning.
It was like, let's solve like a, you know,
a response.
But there are two more things
that they are very interesting
that we are looking at that day.
In general, one is your companion for doing things, right?
It's like in the 98% of the time
that you don't have an incident,
you still have to touch production
because you need to do your rollout.
You need to do your observe performance regression.
You need to do whatever, right?
You need to move one service from one cluster
that you have in some region to another.
Capacity planning, moving, yeah,
all kinds of stuff, yeah.
All these things, right?
The teamwork.
Agents can't help with that because at the end of the day,
the tools that we use when we are responding to an incident
and the tools that we would use in this case,
they are the same.
It's like, you know, you need to apply a drain.
You just call the capacity tool
or whatever QCTL that you are using for the day
and it's essentially the same thing.
And the other is, you know,
the best time to mitigate for an incident is zero.
So it's preventing the incidents to happen, right?
And there is also, it's more complicated, right?
Because there is a whole sorts of risk prevention
that we can have, right?
So there are things that we know that they are bad.
Like for example, one thing that is very common
and we have spoken about them in the books and so on,
it's that, you know, many outages,
they are driven by a change, right?
So there are policies and there are best practices
around many different places, not only Google,
but like, you know, do changes progressively,
do changes like incrementally
by your different geographical locations
or these kinds of policies, right?
So many times these policies are not enforced
because you need to remember to put it in your configuration
or whatever or your service configuration drifts over time, right?
So agents might help us enforcing and discovering these spots
where there are risks that we don't know about,
like, you know, dependencies that, you know,
a service that has five nights of SLO,
depending on, like, one service that has two,
because someone started sending some RPCs
or whatever that is, right?
So discovering these kinds of things
can help detecting things before they manifest into an incident.
Let me give you an example of this, right?
So at Google, we build planet scale systems.
That means we have access to a lot of sophisticated techniques
for, you know, caching, admission control,
and we have all kinds of libraries and utilities
that we use internally.
They offer a lot of different control knobs
for different things because they have to handle
all kinds of cases.
And there might be cases, let's say you have a service
that you are trying to add a caching layer on top of
or admission control.
Now, with admission control, there are many ways to do it, right?
There is five four, there is first in, first out,
last in, first out, and other variants thereof.
If you decide one of these strategies,
and if, at this point, the QPS that you are getting
from your customers is something that would break
your SLOs because of this service,
that is something that the agent could flag
because it has access to the dashboards, right?
It knows what QPS you are seeing.
It has access to other dashboards of, you know,
the other teams around you,
and it knows what their admission control looks like.
So even before you make that change,
so when you send that change out for review,
it can step in and say, hey, look,
you've configured the admission control in this way,
but I don't think that's the right idea based on what I see.
So that's one example of how it can step in
before you do something that will cause an outage.
Cool.
Well, I'm afraid we're running out of time,
so I have one, like, this is called the lightning round.
So each of you gets to answer two questions,
but you have to just answer, you know, just those two questions.
So we're gonna say, like, let's see,
number one is gonna be,
what do you think we should not be using
LLMs and AI for in production today?
Like what's a bad idea?
And then number two is, what do you wanna work on next?
Like, what's the next big thing that, like,
let's say you're just done, it's a year from now or something,
like, what's the next big thing you wanna take a bite out of?
OK, let's start.
One thing where I would not use LLMs for
is for when you can use a regular expression for things.
Total.
Solve problem.
Yes, so, you know, I want to extract,
I have seen it, I want to extract this integer
from all these texts and so on, right?
Let me prompt Gemini for getting that integer, right?
It's like, no need for that, that's fair.
I mean, this is a joke, but there's
some cases where the classic statistical method
or like various small specialist models
for doing certain things, like anomaly detection
over time series, right?
It's a much better suited model in there,
faster, cheaper, more reliable, right?
So, I think we need to balance the use in LLMs
for what they are good for and all the termings
for what they are better suited.
So, you're saying math is still has value even today?
Yes, math is still is not deprecated.
Like, actual operators, got it.
I'm really happy to hear that.
Thank you.
Yes, what I would like to work next, I think, is the,
so, how I see autonomy of these agents
is something that is incremental.
So, we're going to see like levels
of autonomy going on, right?
So, I think we are at the beginning of this,
because all the agents that we have
and the agents that I have seen around, right?
They always require a human verification
of before going and execute some actions, right?
So, I think one thing that I want to explore
is like how could be the conditions
or what we should be having in production
for leaving these agents to do things automatically, right?
So, how we could, for example, recover
from an agent breaking some stuff
or the agent recover itself and these kind of things.
I think that's going to be like a very large challenge,
particularly in the production space.
Yeah.
So, what do you think?
So, I think one thing I would not like to see LLMs
used for is insulating humans from challenging situations.
And so, I'll give you the example of production outages, right?
We are building agents that help out,
but we don't want it to work in a way
where humans don't know what is actually going on,
because as engineers, that's how we learn over time, right?
Our designs get better
because we've gone through these outages.
So, we don't want it to be the case
that it's completely offloaded to an agent
and we have no idea that an outage even happened
because the agent took care of it for us.
Because in that way, we'll never learn, right?
So, the reinforcement loop is broken for the humans itself.
So, that's one thing I wouldn't like to see LLMs being used for.
I'll throw out there.
There's a famous paper called the ironies of automation,
where the more you automate a thing, the worse you get at...
Like, you have to depend on that automation,
so when it doesn't work, then you're in trouble.
Yeah, so, we don't want to break that loop.
I'm totally with you.
All right, what's next?
I think, so, we are still...
There's a lot of problems to solve
in building production scale agents out there.
But one of the things I'm often interested in
is interaction models.
Right now, we are talking about embedding AI
in all kinds of things.
But if you look at the interaction models,
they are still very limited, right?
It's mostly a chatbot, which comes in many forms, right?
Or, you click somewhere and you get some output
and you can't answer back to the agent.
But I suspect, as we get more familiar with these systems,
there might be a lot of other potential use cases
that emerge just because we've changed the interaction models.
And so, it's interesting to see where we could take this,
because at this point, I'm not an LLM expert.
Right?
So, I have very little control
of how the models emerge over time.
But I think the models are good enough.
And the thing we are lacking at
is improving the communication between humans and LLMs,
which is where the interaction model comes in.
Nice.
All right, well, that was a very great discussion.
Thank you so much.
We could definitely keep going.
I have a feeling on and on about this kind of stuff.
Matt, any closing thoughts?
The agents are the new software.
I had no idea.
They're going to do a composite thing.
I had no idea this is what we're going to be doing.
But we're going to be doing them.
It's great, man.
I was saying today one thing that is AI
is something that computers are not good at.
But then when you get good, it's good at.
We call it software, right?
So, I think in a few years,
we call it lovely software and we are done, right?
Right.
Just another flavor of software.
All right, well, thanks very much, guys.
It's been a blast so long.
You've been listening to Podcasts,
Google's podcast on site reliability engineering.
Visit us on the web at sre.google
where you can find papers, workshops,
videos and more about SRE.
The podcast is hosted by Steve McGee
with contributions from Jordan Greenberg,
Florian Rathgeber, and Matt Siegler.
The podcast is produced by Paul Guglielmino,
Sonny Schau, and Salim Virgi.
The podcast theme is Telibat by Javi Beltran.
Special thanks to MP English and Jen Petoff.
You missed a page from Telibat.

Episode suivant:

The One with Ben Good and Our Kubernetes Friends

Les infos glanées

Je suis une fonctionnalité encore en dévelopement

Signaler une erreur

GoogleSREProdcast

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

Card title

Lien du podcast

[{'term': 'Technology', 'label': None, 'scheme': 'http://www.itunes.com/'}]

Go somewhere