Folge 39: SRE at Google [EN]

Klicken Sie auf den unteren Button, um den Inhalt von podcasters.spotify.com zu laden.

Inhalt laden

In this episode, Luca talks to Adrian Ratnapala, who is a SRE at Google. They explore how SRE views working on infrastructure, automation vs. manual work and „toil“, and what it means to view infrastructure as a software product.

In this episode, host Luca Ingianni interviews Adrian Ratnapala, a Site Reliability Engineer at Google, discussing the intricacies of SRE, its overlap with DevOps, and the challenges of maintaining large-scale systems. They delve into the nature of ‚toil‘, the balance between automation and human intervention, the concept of error budgets and SLOs (Service Level Objectives), and the unique aspects of Google’s approach to system reliability and efficiency. The conversation highlights the importance of effective communication within teams, the evolving nature of monitoring and alert systems, and the role of SREs in ensuring system reliability while minimizing direct human involvement.

Inhalt

  • Introduction of Host and Guest
  • What is Site Reliability Engineering (SRE) at Google?
  • Differences and similarities between SRE and DevOps
  • The concept of ‚toil‘ in SRE and methods to reduce it
  • The role of automation in SRE and its implications
  • Monitoring systems: Balancing white box and black box approaches
  • Service Level Objectives (SLOs) and Error Budgets: Strategies and Management
  • The impact of SRE on user experience and responsiveness of applications
  • Communication flow in addressing bugs and system issues
  • The human aspect in technology and system reliability

Shownotes

Wikipedia page on SRE
The Google SRE Book

Transkript (automatisiert erstellt, wer Fehler findet darf sie behalten)

Welcome to a new episode of DevOps auf die Ohren und ins Hirn, or in English, DevOps from
your ear straight to your brain.
Usually, this is a German-language podcast, but sometimes we make exceptions for international
guests.
My name is Luca Ingianni.
I’m a DevOps consultant, trainer, and coach, trying to get teams to work together better
and create better products for their customers.
And I host this podcast together with my colleague Dirk Sölner, who, however, is on holiday
at the moment.
Today, I have the pleasure to introduce my very old friend and exceptional engineer,
Adrian Ratnapala.
Adrian works as a site reliability engineer at Google in Sydney, Australia.
Since I have only a vague idea of what site reliability engineering is, I figured it would
be a good idea to invite him and ask.
Hi, Adrian.
Thanks a lot for agreeing to be on this show.
Hi, Luca.
Would you like to introduce the listeners to yourself?
Yeah.
So, as you said, I’m a site reliability engineer at Google.
I have been some sort of engineer, well, I guess I’ve been some sort of computer engineer
for perhaps six or seven years, whenever it was that I moved to Munich first, where
I did various things in the contracting industry there, but I am not going to inflict my German
on the audience.
So, I’m afraid.
I’m not going to speak German, I’m afraid that we’re going to speak English.
However, as time moved on, I decided to move back to where my family and other people were.
And that led to an opportunity to join Google as an engineer.
And it turned out to be that I joined as a site reliability engineer, which I was also
not very clear about.
what that meant until I joined.
And maybe that’s a bit of a theme
because I think that we can describe
what it is
and
but the realities
depend on where you’ve been
and DevOps
and SRE and all of these things, I think
they’re very context dependent.
Okay, yes, thank you very much.
Let’s start with the first question
which we ask all of our
guests, which is, what is your
definition of DevOps, given that there is
no common definition, what’s
yours?
I guess I don’t have one because
I
think
in terms of SRE
and I’m aware of people
doing DevOps and it
often sounds very
similar to what we do.
And I
am
willing to
go with the definition that I find in
or at least the description that I find in
Google’s book about SRE, which is
that SRE
is essentially a subset
of DevOps, that it’s a
more specific term
because
at Google
we invented this concept of SRE
many years ago and
DevOps wasn’t
as famous as it is now or maybe it wasn’t
a term that was in use
and
in time we found that SRE is
part of that.
But I believe that there’s probably
different people who mean different things.
So I think
sometimes people talk about
DevOps being about
unifying development
and operations.
And SRE is specifically about
having operations specialists.
So those are not really compatible.
But other people seem to
use DevOps in other ways and I won’t
try and second guess that.
Okay,
so you said
SRE is about having
administration, sysadmin
specialists.
What does that mean? How do you
work? Because the idea
in DevOps is to
have
cross-functional teams, you know,
where you have people of different specialties,
maybe software developers,
maybe sysadmins, etc., etc.,
work together towards a common
goal of, you know, delivering
functionality to the customer.
Is that
not what SRE is aiming
for? Or what is SRE aiming for?
SRE,
I guess one way of putting it
is that
we’d be very leery of that
term sysadmin.
Because a sysadmin
brings to mind the idea
of someone who
knows a lot about Unix
incanting commands into
their, you know,
big server in the back end,
which was, you know, the way things
might have happened in the 1990s or whatever.
The
concept of
SRE is that
that model doesn’t scale
to
a company like Google or increasingly
to other companies.
And so you
need to ease that
burden of having to do actual sysadmin
work by treating
operations as an engineering
discipline. So that you
can actually build out things that
do things at scale.
We can drill into the details of what that actually
means at the time. But the idea is
you do SRE so that you
engineer away the need to do sysadmin.
And
the kinds of engineers who specialize
in that are SREs.
And it doesn’t mean
that developers don’t also do that, but
if we’re talking specifically about SREs,
then they are specialists in operations.
Okay, so now you’ve explained
to us what SREs don’t
do, but what is it that they actually do?
Right.
I was afraid you were going to say that
I’d explain to you what SREs did, and I
didn’t think I had.
I did say that we
do
engineering to maintain
operations.
Which is
quite a big, vague thing, but
anything that will be
useful to keeping a service
running with
a minimum
of what we call toil
is valid engineering work.
So that might mean working
on the actual software, the actual
server that we’re standing up and you’re fixing
bugs or something like that.
But it might also mean working
on automation for rollouts or
working on the
monitoring and all
of these things.
Okay, so
you just touched on this concept of toil.
Can you explain that a bit more for the benefit
of the listeners?
Yeah, toil is
any
kind of
work that is
well, one way of thinking
of it is that it’s
generated by
an event that will recur.
So if
one kind of toil
that we would definitely want to eliminate would be
if a machine goes down, then
you would have to restart the server on
another machine. And we
don’t have to do that because we have
there are long-standing
automation systems of
automation that prevent that. But that would be a
very clear and simple version
of toil.
Anything where
someone has to do
some work in order to keep the system working
but that doesn’t overall
improve the system for the
future
would be toil.
So any kind of like running
in place, if you will?
Yeah, you could call it that.
Okay. And if I understand you correctly,
the way you deal with toil
is that you try
to eliminate it by
automating it away.
Yes, I’m hesitant to say
it’s necessarily automation.
Although I guess it depends
on how broadly you mean that term.
I mean, if you have some kind of toil that is
created by
a bug in your server, then you want to fix
the bug.
Okay, that’s interesting. So
toil is like any kind of manual
work and of course
removing the root cause for the manual work in the
first place would also be a valid
way to remove
toil and arguably actually a better
way, I suppose. Yeah, it’s
better if you can simplify a system
than to complicate it.
Automation is an interesting thing
because it both simplifies and complicates.
Okay, and how does Google
deal with that sort of thing?
How do you make these kinds of trade-offs,
those kinds of decisions?
How do you decide what to
automate or what to maybe
eliminate, which means
going further upstream and
probably changing the actual product
in some way?
I guess it mostly depends on opportunity.
It’s not
always possible to fix a bug. It’s not always
caused by a bug or something like that.
I think that that’s a very case-by-case
thing. If you’re facing
a problem,
which is usually a source of toil,
then you have to,
consider what the different
solutions are and what
solutions are available to you.
So, if it is that you
could fix a bug or
remove a component entirely, then
sure, you do that, but in the real world
it usually is rather that
you need to tweak some
configuration and it’s
more common that lines of code get added than removed.
That’s just the way of the world.
Yeah, fair enough.
Let’s stay with
those, with automation,
for a second, because I was
wondering, what are the trade-offs
of automation? You just
touched on one where you said,
you know, if I automate something
that means, you know, I’ve just made the system
more complex. Are there other things
that come to mind? Other things that you’ve observed
over the course of your work?
I mean, it makes things complex in one sense.
If I, say, write a script
in order to run
some other program repeatedly,
then my script is new code
and a new system of failure.
But it simplifies in another way
because now maybe that script is, you just
run it and you don’t have to worry about all of the different
parameters of the thing that it’s running
because that’s now automated.
I think that’s a very familiar thing to anyone
in the field. I don’t think
that’s different in Google versus anywhere else.
And I think another thing that’s
universal is that over
time you do get more automation.
So that trade-off always
at some rate moves in the direction of
having more
things that
live at higher levels and take care
of things that are at lower levels.
The trade-offs are that overall complexity
and I think another important trade-off is
the
human knowledge.
Because if you, especially
if a new engineer comes in,
then they will day-to-day deal
with things at the level
of whatever automation is used day-to-day.
And they’ll learn things
at that level, which
could be a problem
if, to fix things,
do their job, they need to understand things
down at the lower level that has been
partially automated away.
So I think the best automation
is one that provides a non-leaky
abstraction. And
that’s not easy to
achieve. So
one of the trade-offs is
how much leakiness
should you tolerate
in order to get the benefits of not
having to do the toil.
Yeah, I understand.
And I suppose that’s particularly
important at a place
with just that much infrastructure
as Google has, where you
must abstract things away
or you’ll just drown in mountains
of physical servers or whatever,
don’t you? So how
do you think, is
Google in any way special
compared to other
organizations, even other large
corporations? Or do you think
the same problems would exist elsewhere, just
at a smaller scale?
I can’t speculate about
all organizations, especially large organizations
which probably do have many similar
problems and solutions to us.
One thing that is different
between certainly Google and
some other organizations
at a smaller scale, at a much smaller scale,
would be that
doing things manually would
almost never mean SSH-ing
into some server and
incanting Unix commands, as I talked
about before, on that server. Because
the most obvious way in which
Google has scaled up and the most
massive way in which Google has scaled up is the number
of machines. So you simply
can’t do that.
There are ways you can if you
need to in order to fix something, but
I’ve never had to do that.
Whereas there might be
an acceptable way of doing things if you’re
small enough, that you
only have a few servers.
And it has the benefit that the people who go and do it
know how those machines are set up.
It just scales really badly,
which might be okay if you’re small enough.
Yeah. So, I mean,
that’s always this thing, that
the nature of your
scalability solutions, the automation
or whatever else you do in order to be scalable
needs to be some sort of match to your actual
scale. And I guess what I’m
saying is that one of the
reasons that a small organization
might not want to use a large-scale
solution is just that knowledge problem
that if the manual work, if the
manual toil isn’t so big, because you’re small,
you have a small number of servers or whatever,
then you have a chance of getting
your hands dirty and learning.
Your engineers have a chance of learning
exactly how things work.
We do that same thing, but at a
higher level, not down at the level of
what’s going on on the servers.
Okay. However,
we talked about maybe a better way
of eliminating toil would be to
make the actual
problem go away. For example,
fixing a bug. How would
that work
in the environment you work in?
If you observe something,
there’s toil because of
a bug that you encountered
in your production systems. How would
you go about getting that bug
out of the system? How does the communication
flow between you and
any colleagues that need to be involved?
Right.
Let’s consider a scenario
where you’re an
SRE and you
get paged because
of some high
error rate for some service, or whatever.
And you go and
you deal with this
alert by
I don’t know, maybe you turn
up some more capacity because
you found that something
was running out of memory or whatever. You just
do a quick fix to make it work. Whatever
you do, you’ve made the problem go
away, or maybe you
drain traffic from one place to another so that
it goes away from now.
And further investigations
ensue. If for whatever,
if however we did it, we
found that there was, or we
suspected that there was a bug, then
by that time
we’ll be talking to
the original, to the developers.
But, you know, it could be
something where the SRE
themselves can see what the bug
is. If servers are crashing and you
have a core dump, then you can see
the line of code where it
crashed, and you might see there’s an obvious
failure to check a null pointer
or something like that. In which case
the best thing to do to fix that
is to just go to the line of code,
fix the null pointer, check, and
suggest that this is a fix
to the bug, and you go through the process
of verifying that it really is a fix.
But that might be all there is.
But more usually
it’s something subtler.
Actually, often it’s something
subtler, and in which case
it’s generally a question of communicating
with the people who wrote the software in the first place,
which is the dev team. And
although SREs are not software developers,
we are
in close contact
with the developers of the software,
you know, it’s not like a, I suppose
if you’re in an open source world, you might have
an open source package that then you
use, and then you’re relatively
disconnected from the developers.
But we’re talking about things going on within the
company, so even if you are
separate from the developers, there’s
two-way communication, and you’re part of that
cycle of iteration.
I see, okay, so there’s, is there
even like a regular
meeting or something where you talk to your
developers, or how does that work?
How well do you know these people?
How easily can you talk
to them if you feel that you need to?
Very easy, I mean, for me, the
biggest hurdle is that
I am in Sydney and no developers
for any of the things that I work at are in Sydney.
There are regular
meetings, so I’ll give a little bit of
structure. There are various, this is not
how, what I’m talking about now is not
how things are done at Google, because things are,
things will vary between teams, and the numbers of people,
things will vary between teams.
But I have worked on things where
my team was responsible
for various different services, and then
there were particular SREs who had particular
responsibilities for one or two services.
So I would go to a meeting
with the developers of that service that
I was particularly responsible for every
week, and there we
would usually discuss all of the
outages for that service, as well
as our plans for the,
for the, you know, for the future.
In terms of operations, because that was a,
that’s a sync that the developers have specifically
for the two SREs. We are
very much on the
same page with regard
to where the service is going,
at least on that area
of interface where operations and development
are closed.
Developers have other concerns, long-term
feature work and stuff like that, that doesn’t
necessarily concern us,
although it might, but insofar
as anything does concern us, then we have
very tight loops of communication.
Okay, and
one other thing that I was wondering about
is that if you find
some source of toil,
who is actually responsible
for dealing
with that? Is that
you, is that the dev team, especially
if it’s something that has,
you know, that originates somewhere in the code,
you know, a bug, as we said before?
Yeah, I mean, it depends
on what the solution
is and what the next step of work is, but
if it originates in the code,
it’s generally something that
the devs will have to deal with.
I mean, even if it’s something
that, where I can go and fix the bug,
then I will, the reviewer
of that bug fix will be one of the
devs of the server, because,
well, almost always, because it’s
just logical, right? If you change someone’s code,
then the developers or the people will review the change.
Yeah, I was wondering about that.
How accepting are they of
fixes that come from
further downstream, from the ESRE
team, for instance? Oh, that’s fine.
Once a change is there,
the reviewer is going,
what they have is a change.
It’s not like you think, oh, it’s a change from an ESRE,
it’s a change from a dev. It’s just a change
that can be reviewed
on its merits. So far,
we’ve been talking about issues that
come out of the system, like bug
reports, toil, whatever.
What I’m wondering about is how close are
you to your
users, your customers? Because I imagine
that the way you do your work
greatly influences the experience
that users have, especially
in terms of responsiveness
of applications, that sort of
thing. How closely
are you
embedded into, or how
close to those conversations
with customers are you, and do you
feel it’s working well enough?
Again, I can’t
speak for all ESREs, but all of
the services that I have worked on
have been backends
of something. And that might mean
that it’s a backend of a backend of a backend
for Rhino, because
all the services I’ve worked on also
have many internal clients,
which will relate
to the end users in
whatever way.
So, our
team, our ESRE team, or my ESRE
teams, has
actually very little
direct
visibility to the end
user. Sometimes we can
infer what an end user might be
seeing if we look at a request
at the actual
content of a request, which we rarely
do, and is
better not to. But
we’re not
directly
facing a user, we’re
facing internal clients.
So, our focus
is on understanding their needs as well
as possible, and
transitively that gets us
to the user. But for
us, we usually infer,
you know, if
a RPC is taking
a minute to complete,
a request is taking a minute to complete, then
possibly that
means that a user is having a slow
end user experience. It doesn’t
necessarily mean that.
So, we were talking about
feedback, and I’m
wondering
what kind of things you
measure, how is
monitoring implemented
at Google? I don’t
mean particularly the
mundane, boring stuff, but on a
higher level, what’s interesting about
the way you view this data
that’s coming out of the system?
Right. So, I mean,
what gets monitored? Enormous
amount of things get monitored.
It partly depends on
just on a technical level,
what you really mean by monitoring, because
there are thousands of metrics that
every server will collect, not just
at Google, of
which a certain amount get collected
by the monitoring
services, and
different things alert at different
levels. But
in general, for
any service, you’ll have a dashboard
with all kinds of different measurements,
which
will vary
depending on the service. So, I’m not going to talk too much
about that, except that there’ll be obvious boring
stuff, as you said, such as the rate
of requests and the number of errors.
When you tie
that together, I guess
a more
concerning question than what you monitor, which
ideally is everything, isn’t really,
is a lot. A more concerning question than what you
measure is what you alert on.
And what
will get a human’s attention, and what
you might look at
at that high level. So,
there, there’s a kind of
tension between a white box
approach and a black box approach.
And the movement is
towards the black box approach. Describe them
both a little bit, which is
you might have a system
where
requests come from a client,
effectively an end user.
We usually don’t talk to the users directly.
And
that then gets put on a
PubSub,
a Publisher Subscriber
queue, just for example. There’s many things
that could stay in the queue and goes
off to some other system and gets
processed there and finally
lands in a
database. So now,
you probably do
monitor every part of that. You monitor
the front-end server, you monitor
the pub
and the Publisher Subscriber queue,
you monitor the database that
holds that queue, you monitor the
system that reads from the queue,
you monitor the database that that system
writes to, all of those things.
And at any
point, something could be going wrong
and you might have metrics
for when to alert somebody about it. And that
would be white box monitoring.
And then the black box says,
what does the end user care about?
So, an error from the front-end
server would be an obvious case of what an user
might got if there are enough errors.
It’s not just a blip. Another thing might
be in
a white box, it would be the
Publisher Subscriber queue
was broken. But in a black box situation,
it’s the end
user wrote some data
but then failed to read it
back after some time
that they would have expected it to be
to become consistent.
So,
in essence, you’re not concerned with
the technical details, you’re just concerned with
the effect that this
defect or whatever has
on the user,
the consumer of this product of yours.
Yeah, so while black box monitoring
I guess is what it sounds like, which is that
it measures things from
the outside. And yes,
it’s usually done as
close as possible with reference
to what we think clients care about.
So, if you define the
service level obligation, then
you want to be measuring
how well you’re doing
that. So, by service level obligation,
I would mean, not just measure
the error rate, but we have got, you know,
we are serving
at 99%
availability in a few minutes.
And then
you measure whether you’ve met that SLO.
And you can define other SLOs such as
freshness or whatever it is.
Something that you think that a
client will be able to see.
And you alert on these things. And the
idea is that that way you’re not
creating unnecessary toil,
because you don’t alert on things the customer
doesn’t mind, doesn’t care about.
The trade-off there is that
if you got some sort of alert
saying that the black box is broken,
then as a, you
know, it just tells you that it’s
broken. Whereas if you had a white box
alert, it tells you what is broken. It’s helpful.
But on the other
hand, I suppose it’s noisier and perhaps
less helpful as far as
the customer is concerned.
Yeah, I mean, this is why
the movement is to
reduce the amount of white box monitoring
and increase the amount of box monitoring.
Because as systems get larger and more
complicated, there’d be noise
if you do too much.
Yeah, so yet again you’re just sort of
moving up
the layers of abstraction, aren’t you?
Yeah, I think so.
I think, I mean,
the thing is we do have a lot of white box
and there’s nothing wrong. It’s just that
as time goes on, the trend
will be the other way.
Interesting. Yeah, and I think
it makes sense to view it from
that perspective. Speaking
of SLOs and
things like that, I read
about something that I found interesting, which is
error budgets.
Can you speak to that for a little bit?
Yes.
There are multiple different kinds of budgets, such as
an error budget or a latency budget.
I guess that’s what it sounds like. If you’ve
promised to have 99.9%
availability over
a month, then that means there’s a
certain number of errors that you could
observe.
What could you say?
That you
use
that budget to find out if you have a problem.
So, if you suddenly
have a serious outage, then you might burn
throughout your entire month’s error budget
in a few minutes, if that error budget was small enough.
And that’s one way
you can make the argument, that was
a serious problem, whatever it was, is something
that needs to be prioritized over some other
thing, which might also have cost
or merits. But, um,
wasn’t, um,
wasn’t as far out of expectation
as necessary.
I’m sure there’s more
detail that people could go
into, but I think that it would
be, that these things become clearer when
you’re actually doing it. Yeah, it’s interesting
if you read the, if you read Google’s
SRE book, they say that
there is an error budget and you
ought to spend it.
Yes, okay.
If you’re required to have a
99.9% availability,
it’s actually not a good
idea to get your availability up to
99.99%, because that
just means you’ve been overly cautious.
It means you’ve been overly
cautious and it means that you,
um, make people
dependent. People will not be
dependent on your published SLO, they will be dependent
on what you actually deliver,
which is frequently above what
you claim you deliver. So,
this is a common problem that you have at Google,
that I think is particularly
maddening with latency, because
you will have a system
where your database
normally responds in a millisecond.
And your system is basically limited by
the latency of the database.
But sometimes it will respond
much more slowly. And
so you know this, and you
publish an
SLO saying that we will
respond in 500
milliseconds at 99% of the time.
But actually you’re responding
in 5 milliseconds 99%
of the time. And
people will get used to that.
And when you start
responding in 100 milliseconds,
you will get people coming to you
saying, why is the system broken?
And you can’t blame them for that.
Well,
it’s just the reality of life.
So…
You’ve created an implicit contract
and you violated it.
That’s right. And so it’s
good to
align your SLOs
with reality. So
one way of putting it is, if you have a narrow budget, you can spend it.
But probably
the more common way
or the
better way is to
make sure that
you’ve got the correct
SLO. So in fact,
this is a project that I did last quarter,
if I remember correctly, which
was starting with actually measuring what our
performance was
over various clients and
over various variables.
And using that to write
a new definition of our
SLOs. And it’s
surprisingly non-trivial work, because first you have to
measure it, then you have to use that
to consolidate all of that information
to define what you
think your SLOs should be.
And then you actually have to measure those things, which
in this case required measuring
new things that we weren’t measuring in the past
in order to make that
alignment happen.
Interesting.
Yeah, I was
going to ask how
else do you
observe that
SLOs and error
budgets are sort of actively managed?
Like, if I’m taking
your example from before to an extreme and
you say that, yes, your database
will deliver, will
respond to queries within
500 milliseconds, should you just always
respond within 500 milliseconds
just to give nobody false ideas?
I don’t know.
I’ve been
tempted to do that.
There’s a little bit of that.
The SRE book, in fact, I think,
contains a famous case of
our lock server, which
has
deliberate outages in order to keep
people honest.
But
more often
you don’t do that.
You don’t want to introduce unreliability
to your system just because
you can. But,
for example,
with a
system like this database, one
thing that might be is to just
engineer the problem away. So you have
a database that normally
responds in a millisecond but might respond in

  1. So maybe
    you have, it might
    be a solution to be
    talking to two instances of the database,
    independent ones, and
    then you can guarantee that
    you have a very
    good chance that at least one of them will respond
    within 20 milliseconds at least.
    So now
    you can offer a more
    reasonable SLO
    than this enormous 500
    millisecond one.
    So there’s
    as always, there’s
    different solutions depending on what the
    actual problem is and what opportunities arise.
    But yes, sometimes
    the solution actually is to
    deliberately not
    offer the
    performance that could be offered
    but only unreliably.
    Okay.
    As sort of the last topic
    I wanted to explore with you today,
    I would like to talk a bit
    more about the people
    who are part
    of those systems at Google.
    Because any sufficiently
    non-trivial system will always
    have humans as one
    component of it.
    And
    especially at Google scale, where you have
    a lot of technology, how do you balance that
    against the humans who
    must essentially necessarily
    be part of that system?
    Are you talking about
    Are you talking about
    how humans learn about the system?
    Maybe.
    You know, sometimes humans
    are the source of errors, sometimes
    humans are
    the corrective element.
    Mhm.
    And I’m just wondering how
    something that is
    that technology dependent just
    because of scale, as Google
    deals with the fact that humans will still
    be a part of that. And you as SRE
    are obviously the
    sort of most crucial part of that
    system that delivers, I don’t know,
    whatever service.
    One thing I’d like to
    emphasize is that SREs
    and developers both run services.
    It’s
    in a front line situation
    where
    serving things for the user.
    It’s not just SREs.
    It’s SREs and developers both.
    I guess the
    relationships between
    humans and the machine
    or the machines
    are different. There are many different
    dimensions on which we can talk about.
    So we’ve talked about
    automation and how that creates
    a stack of different things to learn
    which people will mostly learn
    the top layers of.
    And so
    we’ve also talked about
    monitoring and people being alerted.
    So that almost with people feedback loop
    as part of the system.
    Do you have any
    do you know which
    angles you are most interested
    in talking about?
    No, not really.
    I was just exploring this just
    because it’s such a crucial
    aspect that, you know, every
    non-trivial system
    contains humans
    as one or more of its
    elements from aircraft
    to cars to
    web servers.
    Okay.
    I guess
    this is the Munich
    this is the Munich
    coming back
    because that’s a lot easier to understand
    in terms of aircraft and cars, I think.
    But it’s true that
    an aircraft or a car
    has an operator who is a driver
    or whatever, a pilot.
    And I suppose in that sense we are operators
    but we are not drivers or
    pilots of the system, right?
    So
    the person who’s
    driving your Google experience is really
    the user.
    But we are there
    as part of the
    system that keeps it reliable.
    And ideally
    I guess we, I haven’t thought
    about it in this way, so I’m making things up
    a little bit, but I think
    that aspect of our role is huge
    but it’s also the very thing we want to minimize
    because it is by definition toil.
    Because it’s
    what happens as part of the system
    being reliable. If the human being is part
    of the system being reliable
    then what you would have preferred
    would be that the human being, the
    engineer, had previously engineered
    something so that the system
    was just reliable on its own
    without human intervention.
    So
    we are
    part of that system because
    computers ultimately
    are only computers and they can’t take care of themselves
    and they do need people to polish
    their gears and stuff. And another
    aspect of that though is that
    we actually do want some toil.
    We want to
    be getting paged every so
    often.
    Partly so that we get practice
    at dealing with debugging our
    systems.
    But also so
    that that knowledge
    that we gain can go back to
    our engineering efforts.
    Wow, that made perfect sense.
    I found that fascinating to hear.
    And I think this is an excellent place to leave it.
    Adrian, thanks so much for being
    on this podcast. I think
    our listeners will find it very
    very interesting and I know that I enjoyed it
    greatly. So, thanks again.
    You’re welcome.