Folge 58: Observability mit Alden Peterson 1/2 [EN]

Klicken Sie auf den unteren Button, um den Inhalt von podcasters.spotify.com zu laden.

Inhalt laden

Observability (kurz: O11y) ist eine wichtige Komponente von DevOps. Darüber sprechen wir mit dem QA-Experten und Observability-Fanatiker Alden Peterson von Netflix.

Diese Unterhaltung war so fruchtbar und interessant, daß wir uns ein weiteres Mal dazu entschlossen haben, sie in zwei Teile aufzuspalten.

In this episode, the hosts, Luca Ingianni and Falko Werner, along with guest Alden Peterson, delve into the nuances of observability in software engineering, discussing its intersection with QA, DevOps, and SRE. They emphasize the importance of understanding and managing risks in software deployment, highlighting the shift from traditional QA towards a more integrated approach within the software development lifecycle (SDLC). The conversation revolves around the challenges in creating confidence and trust in software systems, the role of observability in different stages of the SDLC, and the impact of team structure and organizational priorities on software quality.

Inhalt

  • Introduction to observability and its relation to QA and DevOps
  • Alden Peterson’s career journey and perspective on software development
  • The concept of observability in software engineering
  • The role of QA in the context of modern software development and deployment
  • Integration of SRE practices and philosophies in QA and DevOps
  • Challenges in maintaining software quality and reliability
  • Importance of confidence and risk management in the software release process
  • Tactical and philosophical approaches to improving software observability
  • Organizational structure and its impact on software development and observability

Transkript (automatisiert erstellt, wer Fehler findet darf sie behalten)

Welcome to a new episode of DevOps auf die Ohren und ins Hirn.
Or in English, DevOps from your ears straight to your brain.
Usually, this is a German-language podcast, but we sometimes make exceptions for international guests.
Today is one such exception.
My name is Luca Ingianni, and I host this podcast together with my colleagues Dirk Sörner and Falko Werner.
We’re DevOps consultants, trainers and coaches, trying to get teams to work together better and create better products for their customers.
Today, it will be Falko and me running the episode.
And we’re joined by Alden Peterson, who has lived in the DevOps and SRE and Observability space.
Alden has worked as an SDET, Software Development Engineer and Test, Systems Development Engineer,
and now a Senior Software Engineer, but has always been passionate about working in the QA and Observability space.
Whether that’s building tools for production monitoring, DevOps tooling, or helping teams understand observability,
he has spent most of his software career working to help teams have better insights and confidence into operating and building software.
Alden, thanks for being here.
Yeah, thank you.
Thank you so much for having me.
So that was quite a mouthful of an introduction.
That sounds like an awesome career.
To be honest, you know, I started my work as a test engineer, and to me, it’s still the most fun of all the different aspects of software development.
Is that the same for you?
Yeah, I like, I like the general space.
It’s interesting to me to see some people really like writing feature code, and like they just live for that.
For me, I get bored because it feels like you’re doing something else.
Yeah, I get bored because it feels like you’re doing something else.
Yeah, I get bored because it feels like you’re doing something else.
I get bored because it feels like you’re doing the same thing over and over again.
And I understand why people would love that, because there’s something to create.
You know, I think one of the challenges, whether it’s DevOps or testing or observability,
is you have a lot less of a tangible thing that you’re making.
You know, you might make a dashboard, or you might make some metrics.
And for a lot of people, that’s just not satisfying.
You know, it’s like you spend two days or a week, and then what do you have?
You know, if you write feature code, it’s like, look at this beautiful UI I wrote.
Or look at this, like, great service I wrote.
And then, you know, kind of in this…
This whole space of, you know, supporting engineering work.
I can understand why people don’t like it.
I personally think it’s fun, because I feel like you get much more interesting challenges,
because you’re always doing something different.
You know, you’re doing kind of the same ultimate goals,
but all of the tools are always different.
So for me, I’ve loved it.
You know, my career has been very strange.
Even before I was in software, I worked as a manufacturing engineer.
So I worked…
You know, my background is actually mechanical engineering.
So I’ve kind of had, like, a very…
Yeah, I’ve had a very…
You know, roundabout way to get into tech.
And I just…
I think it’s interesting.
It is interesting, I suppose, that it seems to me…
Maybe that’s some kind of observation bias,
but there’s a lot of mechanical engineers in, broadly speaking, the QA space.
Yeah, I think that of all of the engineering disciplines,
mechanical engineering is the closest to software,
because, you know, part of the reason I ended up in software
is because programming solves a lot of problems,
in the mechanical engineering space, too.
You know, in my undergraduate, I had some internships
where I basically programmed stuff.
This was in VBA.
So if you’ve ever worked in Excel and VBA,
that was my introduction to learning how to program.
And it was something.
But, you know, there’s a lot of business problems
that you can solve in that space.
And I think, like, mechanical engineering,
of all of the engineering disciplines, has more of that.
And I think, too, a lot of people have this grand idea
in that I’m going to design, like, you know, cars or tractors
or, you know, rocket ships.
And then, you know, you realize that, well,
you might design, like, tiny components
that go into, you know, a larger component of that machine.
And I think a lot of people get a little disillusioned by that,
just because, you know, it’s kind of like what I was talking about before.
You want to own this thing.
And it’s like, I don’t really want to own, like,
you know, a tiny subcomponent of a subcomponent type of thing.
Yeah, what I like about software and mechanical engineering,
since I’m a computer science for engineering master,
so…
I see that it’s helping, like, developing software
for creating and designing mechanical systems
and managing it.
And that’s also a part where you see the difference
and the ways of integrating both ways of work.
Absolutely.
Right.
Anyway, this was a very roundabout way, I think,
of getting to today’s topic.
Which is supposed to be observability.
Alden, what the hell is observability?
Yeah, so I think it’s an interesting question.
So I’ll give a couple different answers,
because I think the answers, the most common answer,
I think, is more of, like, maybe I’ll call it, like,
the classical SRE approach.
So, you know, kind of like Google’s SRE book,
which is, I think, a little bit higher level
than what I would personally define,
but kind of what is going on, right, in the system.
And whether that’s trying to get to some sort of,
you know, kind of, you know, kind of, you know, kind of, you know,
you know, error budget or, you know, uptime percentages
or, you know, monitoring, alerting.
I think that’s where a lot of people gravitate.
I think, from my perspective, it’s more like,
what is giving you confidence that what you’re building
and operating works?
You know, is that, and this is where I think, you know,
when we first met, I think some of our conversations are,
you know, so I’ve kind of lived in the QA space.
I don’t, I have very strong opinions about QA.
And I think longer term QA and SRE and even DevOps
all kind of merge together, because they’re all kind of answering
that same question, which is, how do we know
that we can have confidence that what we’re building works?
And then also, how do we actually do it, right?
So I think, like, there’s an aspect to observability,
which is kind of philosophical, but then tactically,
I think it breaks down, and I can talk more about this, too.
I can go on for a long time, because this is interesting to me.
But, you know, there’s, I think QA is a big part of observability.
I think DevOps, which is very much an overloaded term,
but the DevOps, like, methodology, I think,
is a lot of observability.
You know, if you’re going to,
if you’re going to fully do CICD,
you have to know what’s going on, right?
You can’t just, like, ship code to prod
and just, like, cross your fingers that it works.
You know, if you’re doing really continuous deployment,
you need to have some idea that what I just deployed works.
Maybe that’s metrics coming off the service.
Maybe that’s some smoke testing.
Maybe that’s, you know, some sort of synthetic monitoring,
so you’re running customer, you have to have something.
Otherwise, you have no idea what you’re deploying to prod.
And I think, like, that’s an aspect to observability.
And I guess I’d be curious what you both
would define observability as, too.
I found that it’s one of those terms
that a lot of people have very different interpretations on,
kind of based on their background
and where specifically they look at.
Yeah, that’s an interesting question.
So, because similarly to DevOps,
I’m kind of struggling how to define observability.
I suppose some people define it
from a very technical point of view, right?
It’s, you know, it’s the thing with the dashboards
or something like that.
Same, like, to question,
quite a few people,
DevOps is about CI pipelines,
which in a sense it is,
but on the other hand,
those people, I suppose,
are missing the point, aren’t they?
And maybe that’s the same way
it works for observability, isn’t it?
Yeah, I think from when I think about stuff,
whether it’s observability, QA,
and even really even DevOps,
like, I think it’s easier to define it
with kind of, like, the high-level vision.
So, like, I’ve talked with a lot of folks on my team
that my job is to bring confidence to the SDLC.
So, whether that’s building,
whether that’s releasing,
whether that’s deploying,
whether that’s testing,
all of that sort of parts of the SDLC.
And I think observability is how you actually trust
that your confidence works, right?
You can’t, you know,
you can build this really cool system,
but if you don’t actually understand
what the system is doing,
like, are you really going to get that confidence?
And I think for most people,
that answer is no,
especially when you start considering,
okay, well, maybe the engineers trust this system.
And, like, candidly,
I think engineers are much more willing
to trust really opaque,
like, black box systems that, you know,
they understand it worked at one point
and it probably works now.
But I think once you start expanding more
towards product management
or engineering leadership,
there’s not quite the same confidence that that gives
because, you know,
some sort of bash script
that’s in the build process
that does a bunch of magic
isn’t really going to give your PM
a lot of confidence generally.
And so I think, like,
some of the outward artifacts of,
you know, what are we testing?
What are we monitoring?
What are we seeing
about our production system?
How do we debug errors?
I think you do need something more dashboardy.
I don’t like calling it just dashboards
because there’s so much more to this
than just dashboards,
but you need something more tangible
so that you can actually have conversations with people
other than the people
who are very intimately aware of the system.
That’s so interesting.
This whole conversation about trust,
like, can you trust your system?
Can you know that it works,
like, you know, on a visceral level?
Yeah, that’s the word I often use is confidence
because I think, I mean,
if you work in the tech world very long,
you’ll know, like,
you’re never going to have 100% confidence, right?
Because, you know, like, last, was it December?
Log4j, right?
I don’t think anybody on the planet would have thought,
oh, you know, I’m going to expect that a core library
in the Java library is going to cause a giant headache.
It’s like, there’s reasonable assumptions you take.
And I think when we start thinking about confidence,
one of the reasons I like thinking about it in this way
is that it’s just a risk analysis.
Right?
You’re wanting to lower the risk of whatever,
you know, whether that’s buggy features,
whether that’s, I don’t know,
one of the things we’ve been dealing with is
someone’s basically doing a denial of service attack
on one of our services.
You know, you’re wanting confidence
that you can handle those sorts of things,
but also that it just works.
And I think one of the interesting parts of observability,
and this goes across whether it’s kind of
the DevOps aspects of observability,
the SRE aspects or the QA aspects,
is you could spend years on a relatively basic system
to get 100% confidence.
And you won’t get 100%,
but maybe you get five nines of confidence
or something silly.
You know, but like all of us work in companies
where we can’t just dedicate that much effort into things.
And so you have to be very pragmatic about,
okay, what is actually important to do?
And for me, I think that’s an interesting thing
because it’s just, you have to make analysis
around what gives you that confidence.
And whether you call that trust,
I think trust is a very similar thing.
I think I would probably call that the same.
I think Luca used the word trust, confidence.
The word,
I like the word confidence
because I think for non-engineers,
I find that word resonates more
because I think people outside of the engineering space
are like, I don’t know.
Like, I don’t know if this is going to work.
Like, you guys are releasing this large chunk of software
and how do we know it’s going to work?
You know, I think there’s a confidence factor there
that it just, it resonates more.
But I think it’s basically trust, right?
It’s both.
It’s both feelings from what I would say.
So how do you get those types of,
feelings into people?
So what do they need to be trusty
or to have confidence in the product?
So that’s a really interesting question
because I think that can depend entirely
on what the existing failure modes have been,
as well as kind of the more holistic,
what should you be doing with, you know,
I’m saying should, should is maybe not the right word.
What is the ideal instrumentation, telemetry,
whatever observability aspects?
You know, I think one of the interesting things,
I joined my current team the end of last year
and it was interesting because some of the major pain points
that my team was having were very specific
to certain failure modes.
Like we were just, we had random production blips,
like stuff just, and it was like two, three minutes.
And so it was very hard to track down.
And so it was from a confidence perspective,
it’s a lot easier if you say, okay, well,
I’m going to like prove that those are happening
because now you can prove that they’re not happening.
And so on my team, that’s a very,
that was a very key thing for me when I first joined was,
okay, let’s set up some just like really basic checks
to see what our site’s doing on a regular basis.
And so in that case, that’s a major confidence booster.
Well, it’s a major insight into what is causing
the lack of confidence, I should say.
Knowing that your site goes down on knowing the exact times
is actually confidence inspiring
versus it just randomly goes down
and we don’t understand it,
but we know we get random customer reports
that we can’t really reproduce.
And I think like one of the challenges,
and so one of the reasons I think like the QA,
SRE, DevOps space all kind of eventually merge together,
is because I think one of the pitfalls
of having those as separate orgs or separate silos
or separate disciplines or whatever you want to call that,
is that each of those has a very different answer
to the question that you just asked.
Because QA, for example, is going to say,
well, any bugs are problems.
You know, DevOps, if I broadly generalize
and say anybody with a DevOps title,
which we could probably have a whole conversation
and that nobody should have a DevOps title,
but that’s a common enough thing in orgs still,
that that’s like, well, the infrastructure,
underlying infrastructure is stable.
And, you know, if you’re an SRE,
like how do those kind of interact together?
But like you’re more looking at much more
of the customer impacting aspect of that.
And I think like all of those are parts
of the answer to the question that you just asked.
And I think one of the challenges is,
especially in an orgs,
if you’re like a dedicated QA org
and a dedicated like some sort of infrastructure organization,
you end up with each of those groups
caring about very specific windows
into that observability question.
And then I don’t know that I’ve seen,
at least in my experience,
I would actually be curious if you both have seen this,
but I don’t see a lot of reconciliation
of how all of those pieces fit together
because they all do fit together, right?
You know, if you have an,
let’s say your infrastructure is bad,
you know, and one of my previous companies,
we dealt with this all the time.
We ran into all sorts of failures on like test deployments
because the infrastructure underlying
the whole thing was just flaky.
And so is that like a QA problem?
Is that an infrastructure problem?
It’s like, no, it’s like,
it’s this confidence observability issue, right?
Yeah, but isn’t this exactly
this model of the wall of confusion
that, you know, is this sort of founding myth of DevOps, right?
That you don’t make this distinction of,
you know, production being up is Ops‘ problem
and bugs not being present is Dev’s problem.
And you replace that by the common perspective,
are we providing value to the customer?
By, you know, shipping a feature
and making sure that it’s stable.
If we can’t achieve both of those,
then we’re all, you know, missing the point, aren’t we?
Yeah, I completely agree
with that.
That’s one of the things I struggle with
is what do I describe myself as?
You know, like, I don’t want to be a,
I’m not a DevOps engineer.
I’m not a QA person.
And so I’ve started calling myself an applied SRE.
This isn’t really the right terminology
because there’s a whole bunch of pitfalls
with that title too.
But like, I like the idea.
And, you know, Luca, to your point,
I think you want someone on the team
who can think about those sorts of things.
And like, that’s the team question, you know,
like the full stack engineer approach
has become really common.
And that’s in an ideal world.
Then your team owns, you know, your infrastructure,
you own your deployments, you own everything.
And then the team can be responsible for that.
But you need that team, someone on the team
to kind of understand and care about this issue.
And that’s, in my experience,
one of the challenges in this space
is that there’s so much going on
from a prioritization perspective
that is very challenging for folks to really,
not so much care, but have the ability to care.
You know, if you’re required to ship features
and kind of you’re accountable for features,
observability can very easily slip away.
And whether that’s observability through like testing,
whether that’s observability through monitoring,
whether that’s pipeline, you know,
like the more quote unquote DevOps-y side
of the release process,
it’s very easy in my experience
to see that kind of slip away as a secondary concern,
which, you know, I’m pretty pragmatic.
Sometimes that is the case.
You know, if you’re a startup,
like you might not have the budget
to like survive long enough
if, you know, you don’t ship feature code.
You know, you might,
your company might die.
You know, you might go bankrupt
if you don’t have feature code.
In which case, yet again,
you’re not providing value to the customer.
So you can still very easily loop it back to,
am I offering the best value to my customer,
either through stability or through my existence?
Yeah, and that’s like,
I think one of the things that I like doing,
I haven’t really thought of the phrasing,
the value to the customer.
I think that’s very similar
to kind of what I’ve thought about this
because you want to determine like what actually matters.
You know, not being like,
oh, we need monitoring
because we need monitoring
for the sake of monitoring.
Like, I think all of us would probably agree,
monitoring is most always valuable,
but that doesn’t mean that every single project
needs somebody who’s, you know,
creating tons of dashboards and tons of metrics.
Like, because at some points it just doesn’t matter.
You know, if you build an internal app,
you know, one of the apps I used,
I wrote years ago now,
was a team.
I sat on the team.
I was on the team I was writing this for
and I had some like really basic error handling.
So any error got emailed to me.
You know, I wouldn’t really say
this is like the best observability,
but it was good.
It was good enough for that team
because the most, you know,
they sat next to me.
You know, if something really went wrong,
they would just walk over 10,
you know, a meter, five meters, whatever,
and go say, hey, Alden, what’s up with this?
It’s like, you know,
there’s a different level of observability there.
And I think it’s just,
it is a challenge because I have found,
especially in the QA space,
that there’s a very like idealistic approach
when it comes to a lot of the things.
It’s like, you need to do this
because this is important.
And it misses the practical realities
that everybody’s busy
and there’s,
there’s almost always more
than you can do than actually do.
And I think really to what you,
the way you’re describing it is kind of,
I actually like,
I’ll probably steal that
and use that terminology
because I think that is like a better way
to talk about that trade-off.
It’s also interesting
what you talked about in terms of,
do you treat this as a QA problem
or how do you look at it?
It reminds me of myself
when I decided
I was going to be a freelancer.
I thought, you know what?
I’m going to be a professional QA person
because testing,
it’s fun.
This is, you know,
this is going to be awesome.
And I came to the shocking realization
that I could not actually
offer quality to my customers
unless I took control
of the entire SDLC.
And so I accidentally found myself
doing DevOps by necessity.
That was,
that’s so funny
because that is how I got into DevOps,
sort of falling backwards into it.
Yeah.
And that’s very similar to my approach.
My first,
I guess,
you could call it a QA job
was when I was hired as an SDET
or Software Development Engineer and Test,
which speaking of titles
that are kind of meaningless.
But, you know,
one of the first things
that I did on that team
was I realized like the PR process
just was bad
because the build took forever.
And so people would, you know,
just not really pay attention to it.
And it was just a headache.
And really,
I think one of the first things
I did on that team
that really improved
the overall quality of the products,
I just reduced the entire CI process time.
But, you know,
and in that case,
nobody had looked at it.
It was really,
it was years later,
so I can say,
it was pretty easy
if you just look at it
and spend the time to do it.
I think it was something like
an 80% reduction
in the total CI time.
And it’s like,
nobody just put the time
in to look at that.
You know,
I think part of it is,
that is like a QA concern,
I think.
I’ve talked to a lot of people,
like,
I think QA,
you end up going to the DevOps side
for the exact same reasons
you were talking about, Luca,
that you end up with,
okay,
so I can write a bunch of tests,
but like a bunch of tests don’t matter
because when do I write,
when do I execute these tests?
How do they get executed?
Well,
you have to think about
where the deployment process is.
Okay,
now you’re talking about
the deployment process,
so now you’re thinking about
the pipeline,
if you’re using some sort
of pipeline thing.
And now suddenly,
you’re thinking about
like the whole DevOps philosophy.
And I think that
that’s almost a requirement
if you want to have
like a QA process
that’s not just write some tests
that get executed
against like a,
you know,
I don’t know,
a build or a test environment.
Yeah, okay.
So,
what is
the
requirement
I don’t know,
system of
observability
looking into?
Is it
all of those things?
The software itself?
The pipeline?
The people?
I think,
yeah.
I think a lot of this
will again depend on
what your role is
on the team.
Because I think
like leadership will say
it’s all of those.
I don’t know that,
I mean,
it’s a challenge, right?
Because where does that,
where does that bubble stop?
You know,
one of the things that I’ve,
I’ve often told people
is that
most of the
hard technical problems
aren’t really technical problems,
but they’re like a mix
of how people and technology
are interacting.
And I think observability
is almost
perfectly in this, right?
Like one of the things
that I’ve talked to
a lot of people about
is I tell people,
if you’re not
going to action an alert,
either turn it off entirely
or put it into a channel
that’s basically
like ignored alerts
or something.
You know,
some people just have
like an obsessive,
I really want to see
all these alerts
because I don’t,
I don’t understand it.
I can’t empathize with it
because for me,
any sort of alert noise
just drives me nuts.
But some people really
have a hard time
turning off an alert
that they know is valuable
but don’t do anything with.
And I think like
that’s a very interesting
combination of like
a people
and a technology thing,
right?
Because you can have
a bunch of valid alerts
that should be
done something with.
But if nobody’s
doing something with them,
is that a technology problem
or is that a people problem?
And I don’t think
you can answer that
either way.
I think it’s a mix
because, you know,
maybe there’s an aspect
to people want to,
but they’re just
so overloaded
because they have so many,
maybe product management
is pushing a ton of like
we need to do feature,
feature, feature.
And, you know,
the poor engineers
are just feeling guilty
about these alerts
they’re ignoring.
You know,
that’s probably
not a common situation.
Sometimes people
just don’t care.
You know,
some automated template
creates a bunch of alerts
and then there’s,
you know,
either they don’t understand
why the alerts
could be useful
or they don’t correlate
to any sort of,
you know,
customer impacts.
So it’s kind of just like,
well, whatever.
And I think it’s,
I think it’s definitely a mix.
I think it’s easier
to focus on the technical side.
And I think this is
one of the things
where I think,
you know,
going back to the beginning,
I think a lot of observability
stuff that I’ve read
focuses more
on the technical side
because it’s easier
to quantify.
You know,
it’s easier to look at,
okay,
you need to,
okay,
like let’s look at
availability and uptime,
right?
Like this is a good example.
It’s,
you can measure it,
you know,
and people like metrics,
you know,
it’s a good,
and the metrics are valuable
and it’s easier to say,
okay,
we’re making an impact
because we went from,
I don’t know,
98% uptime
to 99.5%
or,
you know,
something like that.
And I think like a lot of times
the people aspect
can get missed
because it’s harder.
You know,
you’re both consultants,
so you probably deal
with this all the time.
You’re basically doing,
I would guess,
and this is speculation
since I’m not on your roles,
but I would guess
that the technical parts
are really easy.
It’s the parts,
like getting people
to adopt them
and changing people’s
either preconceived notions
or some of the,
you know,
organizational challenges
that make doing
more of DevOps harder.
Like that’s,
that’s been the experience
I’ve had.
I would,
I’m actually really curious
if that’s how
you would both say it
to you from kind of
the consulting side
doing on that
DevOps-y side.
Yeah,
it’s both,
I think.
You have to have
people with the skills
to get the things
up and running
and to find
a good starting point
to measure.
I think sometimes
you have no measurements
at all
except how many
bugs or service requests
or,
or issues are coming in.
So that’s kind of
the first level
of transparency
or measurement
or looking into
service availability
or uptime,
whatever.
And then when you start,
you get into it
step by step
and need to develop skills
in terms of tools,
in terms of
how to work
with the data
that you can get.
And measure the,
the important things
or at least alert
on the things
that are helpful
for the customers,
for the team as well.
I would even want
to step one step
further back
and try to get
a better handle
on observability.
I think it’s
a broad term
kind of
getting lots of things
together.
You said in a
quite abstract way
create content,
confidence,
or Lukas said
in also similar way
to create trust
or build trust.
I think when I,
when I hear the term
without the,
that deep view
on the technology side,
I feel it’s making things visible
that are somewhere hidden
in the black box system
that you,
that you mentioned
in the processes,
in the tools,
in the tool chains.
What is there
to promote
a solution
to a way
to make
better
the people
in your place
and one that
I think
we need
to say
it can be
or can be
anbeeping
over
I mean
yeah,
it’s still
different
now
from
from
one
view one observability i think that’s that’s good i don’t know that i would say that that’s less
abstract um i think like to make it more more tangible i think it’s really like can you easily
understand what’s going on and maybe that’s maybe that’s just a shorter way to say what you were
saying it might just be a shorter way but i think like so one of the things i really like doing is
setting up synthetic transactions and so like basically a synthetic transaction is hey what’s
a common customer action or behavior whether that’s like a ui you know clicking through an
actual web interface or whether that’s an api call you know whatever your customer is and just
setting up a scheduled version of that because and one of the reasons i like that is because
it bypasses a lot of the pitfalls of like tuning alerts tuning metrics and that sort of thing and
you basically say this is what we’re defining our thing that does you know maybe i don’t know
maybe it’s a you know i’m trying to think of example you know maybe maybe you have some sort
of like
basic i don’t know payment system we’ll run a like figure out a way to run a fake payment
through that system run it every minute or every five minutes or whatever and this comes down to
like what’s the customer expectations on this but you know if you run that every minute it’s kind of
a non-negotiable this is what our system is supposed to do and if this fails it’s a big deal
type of thing and one of the reasons i like doing that and i i think that’s a very insightful thing
because at first it forces you to define what is that critical path for your service or your
website or whatever you know whatever the
application here is it can be a this applies across whether you run like an amazon.com level
website or whether you just have a simple web service that’s kind of a you know a microservice
your app does something fundamental that your customers or users have an expectation of
and just verify that that works and i think one of the reasons i like doing that is because
it simplifies a lot of the complexity around well what is like what what’s confidence what
like all of those things it’s like no we’ve this is like
what our thing is supposed to do and it stopped doing it that’s a problem you know and maybe we’ll
Maybe latency matters, maybe latency doesn’t matter.
You know, you can like build into that sort of really simple check a lot of useful things.
And then you also have a repeatable environment to run it from, too.
One of the challenges in using like real user monitoring, which I’m sure you’re both familiar with at RUM,
is that maybe you have the one person who’s off on, you know, rural Internet in the middle of nowhere
and their site just doesn’t work because, and it’s not because of your site,
it’s because, you know, they’ve got, you know, 20 bits per second of, you know, Internet speed.
And you don’t necessarily know that from metrics.
And in my experience, tuning metrics can be a lot of work, especially for lower volume sites.
You know, if you run a site like Amazon.com or, you know, Netflix.com or something,
where you’re getting millions upon millions of like customer interactions potentially per second,
it becomes a lot easier to define like actual metrics based on site interactions because you have so much data available.
But a lot of us work in spaces where we don’t have that level of clear impacting like data.
And in those cases…
I really think like just doing like some sort of basic synthetic transaction can sort of answer the question that you’re asking, Falco,
because it’s basically what is our, what does this do?
Like, what is, what is the core value proposition?
You know, like we were on this like podcasting app now, like maybe they should just, that like a good example there was like,
just start a podcast and verify like stuff works, you know, that’s, you know, the Zencastr podcast.
That’s, you know, it’s kind of like they’ve got one thing, right?
Like it’s to record and create podcasts.
And so.
So if they wanted to know that, they could certainly try to look at metrics.
I don’t know how many people use this site, you know, maybe there’s enough people using this that they could just look at how many podcasts get started and saved every X minutes and figure it out.
I don’t know, but you know, my guess is it’s one of the sites where, you know, if you looked at, you know, midnight on Friday night, there’s probably not a lot of people that recording podcasts, you know, or, you know, like that’s, you know, one of my previous apps was heavily business cycle dependent.
So we basically made an app that was used as a business app.
Well, yeah, I’m like midnight on Saturday that almost nobody was using it.
But if somebody was using it.
At that time, it was really, really important because that meant some like person was doing some work that was like really time sensitive, generally.
And so we had to be very mindful of that because you can look at your metrics and say, OK, great, our app is running.
But then, OK, Friday night comes and no user traffic and somebody takes down the cluster by accident and now you don’t even know if that app is failing.
That’s I could talk a lot about this because I think synthetic transactions are a really good way to bypass a lot of the pitfalls with observability because you’re you don’t have to tune anything.
You basically decide what’s the key thing here.
OK, but now I need to ask you a question because I’m kind of confused.
Is there a difference between observability and using synthetic test cases and just fairly regular acceptance tests?
Well, so that I think is a good question, because I think acceptance tests in some ways can be useful as synthetic transactions.
I don’t know that I I’m not a huge fan of to the point where you have to.
Run like a large test suite against product.
I don’t I think you can do that.
But I think if you’re feeling like you need to do that, it probably means you have gaps elsewhere in your process.
So, like, for example, if you I guess let me take a step back.
I think acceptance tests should be a lot more about verifying like functionality works and the integration pieces are correct rather than necessarily is my app like core business logic working.
And so in previous companies, what I’ve what I’ve done and actually talked to people about is maybe you have 100 either acceptance integration tests.
Whatever tests that you run against a deployed environment, you don’t need to run all 100 of those as synthetics, but you might just be able to say, okay, these two are actually the core.
So we’ll run those two.
And I think this is part of why I think like observability, QA, DevOps, SRE, all of that morphs together, because when you start thinking about it from this perspective, what you just said is a very natural thing, right?
Like, let’s say you have a QA org that writes tests against the test environment and engineering org owns production monitoring, which I think is a real, really common situation, actually.
Is.
Yeah.
So maybe at this point, we really need to pull back a little bit and kind of figure out what is the relationship between QA and SRE?
And DevOps and observability, and if you want also SRE, how do they fit together?
How are they, how are they traditionally viewed as separate and should they be viewed that same way still?
I’m guessing your answer will be no.
Yeah.
So I think from my perspective, the QA industry is undergoing a similar transition that kind of the systems administrator, DevOps side of did.
And I mean, it’s still it’s still undergoing.
But and that is.
I think people are recognizing that you can’t have the siloed component to your your your engineering org as a broader whole, because with it with SAS, I guess we should probably clarify a lot of this mostly applies to SAS, because it’s very easy, for example, to roll back software in the SAS space.
Generally, you know, one of my previous companies, I worked for John Deere, and if we you know, one of the teams that I was close to the team I was on, they did the firmware update types of stuff.
So they would do over the wire updates.
Well, that’s a very different failure mode if you screw that up.
You could potentially like brick or prevent every single piece of equipment that’s on the field from working.
That’s a very different failure mode.
And so I think with SAS, it’s a little easier to to trust in like a process that will auto roll back, for example.
And so that disclaimer aside, I think QA historically has tried to answer like, do we have bugs in the product?
And often that happens on like, do we have bugs in the product?
Let’s make sure we don’t have bugs in the product before we deploy to production.
I think that’s I’m generalizing horribly.
But I think that’s a pretty decent explanation of what QA often is.
And so you will often see orgs who have quality problems say, let’s hire a bunch of QA people so we can make sure we don’t have bugs before we ship it to the customer.
Well, that’s so interesting.
I’d never thought about it in those terms.
But it sounds like what you’re saying is the the harder it is for you to deploy software, you know, either for reasons within your product, like, you know, it’s a tractor or because your processes are just terrible.
So so the harder it is to deploy into prod, the more you need a separate QA stage and a separate QA team, but make sure that nothing bad goes into prod.
Yeah, I think that’s I think that’s a fair statement.
I think there is an aspect to this that if you’ve made the process hard and it doesn’t have to be hard, you create a higher chance that you have issues, you know, like if let’s say you have, like going back to this podcast, let’s say they push updates once a year.
Well, it’s almost guaranteed.
Like I think most people in tech would say that’s probably more likely to come back.
Yeah.
major issues than if they ship once a week or every commit or some variant on that. So I think
you do have to be a little careful. Now, that being said, I think like oftentimes people’s
first tendency is, okay, we have quality problems. Let’s batch up all of our releases so we can
make sure that they work right. And you kind of get yourself into this cycle.
But I think one of the things I try to be careful of is in these sorts of conversation is
recognizing there are valid cases where a lot of this is, there’s just different constraints.
I’m not sure, the spaces I’ve lived in and I actually much prefer because of this are like
the SaaS spaces where we’re deploying some sort of web application or some sort of internal service
that it’s really nice to not have to worry about a lot of like, you know, I’ve not worked for
companies where there’s major compliancy things, you know, like this whole conversation changes a
little bit if you start talking about, well, maybe you make medical software, you know, or maybe you
make software for governmental approval processes. But I’m so I’m just kind of ignoring that space.
There’s a, I’m probably not the right person for that because there’s a lot of regulatory issues
there.
Yeah, but let me speak to that a little bit, because I have worked on safety critical
software before, like up to ACLD, which means that if your component fails, somebody’s going
to die. And that does influence the way you build your product, doesn’t it? But it only really
changes sort of the last step, which is actually moving it into production, like bolting it into
a vehicle in our case. All of the rest is still very much the same. And, and, and, and, and, and,
but all of these considerations that we talked about in terms of production software would maybe
just be, you know, would be worked on in a pre-production setting, you know, if it’s a
physical thing, you know, in a, in a prototype environment, in a test drive environment,
stuff like that. So we’re still back to the same confidence conversation. We just need to be,
we just need to be sure that our confidence level is way, way higher before we actually
unleash this on the unsuspecting public, right? Right. You know, and,
and this conversation has been pretty philosophical. And I think one of the
things that from my perspective is the case is that a lot of the way you help
influence people on this topic is just thinking about this differently. You know, I think we,
we’ve had a lot of like higher level conversation here, like without as much tactical stuff. And I,
I think that’s, from my perspective, that’s the hard part. The tactical part often is pretty
straightforward, but that’s also very specific to whether it’s the constraints that you’re dealing
with. You know, we were just talking about, you know, whether it’s safety critical software,
there’s going to be different constraints than,
I don’t know, making like the next social media app. But I think like one of the things is like
this whole philosophy. I think if you’re really thinking about observability from a more holistic
perspective and you’re really understanding what is it, what is this idea of observability trying
to do? I think it really translates really nicely to any of the domains that you’re in. You know,
a lot of the times it’s like, and this is something I have definitely experienced is if you’ve worked
in a place where you have decent observability around your software process, I love writing
code in that situation because I, I just don’t worry. You know, I trust that if I merge code,
it’s going to, I’ll either, it’ll either work. It’ll either be some weird edge case I didn’t
know about, or I’ll get an alert or something will fail. And I love that. And I think like
that observability, like is like kind of the key to get you there. And, and again, I think like
this, cause we were, I think talking about this right at the beginning, you know, what is the,
where is the boundary of observability? And I think you can make it wherever you want. You can
really tightly scope it to say like observability.
Is production monitoring and production alerting. And I think, I mean, I guess you could do that,
but I think really you’re trying to like that confidence answer is the whole process. You can
even say like build time observability too. You know, a lot of people have like dealing issues
with flaky tests. Like that’s a common thing and we can call it flaky unit tests, flaky integration
tests, whatever. Well, like if you don’t know, like, let’s say that one of them is super flaky
all the time. It’s going to, it’s going to like break your confidence in the bill and it’s going
to break your confidence that like those unit tests are actually useful.
And maybe you just don’t understand like, like that test fails 50% of the time. And like,
I think some of the newer like APM like tools are doing a lot of work in this space. Like
Datadog I know is building, it’s some sort of flaky test detector. And part of the reason is
because if you can start getting rid of those false positives, you know, it’s the same as
alerting. You know, if you’re dealing with production alerts, false positives are the worst,
you know, especially if you’re talking about paging. You know, if you page someone with a
false positive, you’re basically ruining the ability for people to care about production
pages very quickly. Because, you know, it’s, it’s, it’s, it’s the natural human instinct. If
you keep telling me, hey, something’s on fire and I look and it’s like nothing on fire. And then you
do it five times in a row. Like it’s totally, it’s a normal human response. I think I don’t
fault people at all for not paying attention or not caring about things. Like, why would you care
about, you know, if you get an, if you have an alert channel, that is a stream of alerts that
nobody cares about because they’re pointless. And then there’s one that matters. Why wouldn’t
the world, would somebody actually look at that? But I think like, there’s an aspect to
this where all of this kind of this thinking approach in this philosophy approach is very
useful for creating more of the tactical side of things on like, what, what does that look like
tactically? Well, it’s going to look different tactically. One of, one of the challenges I have
when I talk with people is people, I’ve had a lot of conversations now where it’s something along
the lines of, hey, we’re thinking about QA. We don’t really know what we want and we don’t really
want QA, but we like know you’ve got opinions on this topic. What’s, what’s, what, how to think
about this?
And a lot of times people are basically saying like, we lack confidence in our release process.
We lack confidence in production monitoring, but we don’t know how to talk about it. I don’t know,
I’m sure if that makes sense, but I think like it’s, it’s harder for me to tactically like
describe, you know, a lot of times I think people are looking for like, here, what are the five
things I can do? You know, like I, we have observability problems, we have quality problems,
we have observability problems, we have whatever. And people are like, give me the five tactical
things. And I think people like jump to that. And I think it’s more, you need to think about it,
like kind of the way we’ve been talking about it.
And then the tactical things become really obvious, I think, because, you know, like you
know, it might look like something like I’m trying to think here, maybe, maybe you have production
incidents that take two hours to debug because you can’t figure out what’s going on. Like that’s a
really clear tactical win pretty quickly. You know, there’s, there’s a lot of the philosophical
side of things that I think translates at an organization level pretty easily. And I think
that applies really like this whole space, you know, whether that’s, I’m kind of going on like a
rabbit, rabbit trail, as they say.
But like, you know, QA is the same thing, observability, whether that’s like the SDLC
observability, whether that’s a DevOps side of things and how the infrastructure works, how that
all works, like any, it’s, it’s all kind of the same thing.
Yeah. And I think what the, what the thing is, is the system, which is, you know, the product
itself, plus the people building it and operating it, this whole looping structure, you know,
feedback structure.
And this is, I think, where it closes the loop and where you say, you know, if I have flaky tests, well, that breaks that system, doesn’t it? It doesn’t break it in the, in the, let’s call it forward direction of, you know, features flowing towards the customer, but it does break the feedback direction. And as such, it will break the system as a whole.
Yeah.
I think Falco, you were asking this question earlier, like, is it a people or is it a technology thing? And I think this is why it’s hard for me to answer that. You know, when it, when we think about more of like, do you have production, like for more of like the production monitoring or production kind of incident responses? Yes, it’s both. You know, look at what you’re saying. Do you have runbooks?
I think it’s a hard to answer at all, because the system is both the technology and the people. If you don’t have the people as the sort of reactive, creative part of the system, then you’re not going to be able to do that.
If you don’t have the people as the sort of reactive, creative part of the system, then, you know, who’s going to react to problems? Of course, it’s going to be the people. Yes, of course, they need technological support, be it dashboards or be it something else that, that enables them to observe what’s going on, hence observability, and react to it in a meaningful way. But I think here we are.
Okay. But are you measuring or evaluating the responses of the human or social part of the system as well?
Is this part of observability? Do you look into it?
So that that is a harder question for me to answer. And I think the reason that it’s harder is, I would answer yes. But I think I’m also aware that that becomes very challenging based on where people’s abilities to influence and organizations are to say yes. I think for the average engineer, I think the answer to that’s more of a no, because so much of that is organizational and prioritization. Like, for example, I think we at the very beginning, we were talking a little bit about whether if you’re being pushed to create features.
You know, if you’re getting like, let’s say 100% of your bandwidth is allocated towards feature development, and you want to put time into, say, observability or monitoring or testing or any of that sort of stuff. It’s hard for me to say that that’s the engineers problem to deal with. That’s a much more of an organizational issue as far as how the organization prioritizes the SDLC engineering work. And I think, so I think it depends on your role. And I don’t. So if like, let’s say if you’re an engineering manager listening to this, or a
Let’s say, if you’re an engineering manager listening to this
or a PM listening to this, I think the answer
is a lot more of, yeah, it’s your
responsibility to balance.
We had
a relatively minor incident
recently, but I was very
glad we had it, because it was a very clear
okay, these are the areas
we have problems addressing
this incident. And if we have a real issue,
unlike when this product is more actively used,
we’re going to have some serious issues because of these
things. And that was very easy for our PM
to go, wow, yeah, okay, let’s prioritize.
And so I think, I don’t know if I’d call
that a metric.
It’s not a metric. You can’t quantify it.
What are the gaps? But being able to quantify the risks
I think is really helpful. And you can call
that a metric. I mean, you can’t put a specific
risk to it. But I think that’s
one of the things from my perspective that I think
is really important in this whole thing, too. Because
let’s be honest, the reality is
we’re never, most
of us work in non-life-or-death
situations. Luca, you were talking about that
people could potentially die in your thing.
Most of us don’t work on that.
That type of system. And so the reality
is, when we talk about the risk trade-off
of how do we get confidence,
the desired confidence doesn’t
have to be 100.0%.
Most of us are, I don’t know what percentage,
I don’t want to put a number here because I’ll get
a bunch of people saying, oh, it needs to be slightly
higher. But most of us are below 100%
for what we need from a business perspective
confidence. But you need to be able
to more tangibly say,
okay, we don’t have good logging,
we don’t have good monitoring, we don’t have
good understanding.
So if we have another issue in Pro, we might have to
wait several hours
to make a new bill that has more debugging
or something like that. I think being
able to quantify that from an engineering perspective
can be still useful. But ultimately
I think if you live in an organization
where the value incentive structure is
just not going to support that, I don’t think you can really
you’re kind of stuck. And so that’s where
I think it’s hard to say it’s a people problem
as an engineer. But I think, yeah, if you’re in the leadership
role or you have influence on this, it’s absolutely a mix
of both.
I’m also sort of thinking,
back to
when I was working in safety-critical
products, having this awareness
that people’s lives depended
on you doing your job well
gave a lot of clarity to those conversations
because, you know, it was much
easier to say, no, I don’t feel
confident shipping this yet. And everybody
would be like, oh, okay.
It’s like, okay, fine, we have deadlines, but
also we potentially
have dead people. So
I know where my trade-off lies.
One of the reasons I don’t like
having a separate QA org is because I think
you want the QA org to be responsible for
answering that same type of, more of a risk
analysis rather than a we don’t want bugs.
Because in companies that
there’s a separate QA org, I think what happens
is QA feels responsible for stopping
all bugs, period. Not
stopping enough bugs to justify
still shipping. And so I think one of
the reasons I’m pretty passionate that I think
QA should be ultimately
reporting to some, the
leader who’s responsible for the product, whether
that’s a first-line engineering manager,
whether that’s a director, it doesn’t really matter.
But I think QA is providing that insight.
And this goes back to your question, Falco, really
about people are process or people
are technology. Yeah, QA
needs to bring the risks up. But if
ultimately product or engineering
or whoever’s making that decision says that’s an
acceptable risk, I don’t have a problem
with that. As long as it’s an
intentional decision. I do have a problem if it’s a
non-intentional decision and it just happens.
And I think that’s unfortunately often the case
where people don’t consciously
think about it and just say, well, we need to ship this away.
Yeah. And yet
again, this reminds me of like
one of the projects I was working on was
a steering column lock. So it was
an amazingly stupid
piece of hardware. It could go click or clack,
right?
And there was a very
because it was safety
critical, there was a very deliberate,
very careful risk assessment. And of
course, the assessment said, actually,
it opening at the wrong point in time
is completely
irrelevant. Like, whatever, let it open.
The only people
who are going to complain about it are the
car insurance, whatever.
But if you’re going
200 kilometers an hour down the autobahn
and this thing closes,
you’re going to have a bad day. And having
this clarity about your risk assessment.
And so for the same sort of bolt
moving, we had very different
risk assessments for it moving forwards and for it
moving backwards. One was
whatever and the other one was highly, highly
safety critical. And by the way,
even in this case, we couldn’t
possibly go for 100% safety
because that would have required
infinite effort.
Yeah, that’s a good example, too. We’ve been talking
about this. One of the interesting
things to me in a lot of this space, and I think
having a mechanical engineering background,
we’re both coming from a very different
background, but it’s
observability. I think
in tech, there’s a common thing like, oh, this is
like, we are like unique, special
industry. We’re the only industry that ever deals with
problem XYZ.
And I don’t know,
maybe it’s because I came from a different
industry. I just, it’s not that
way. And so it’s
that issue you’re talking
about, I think it’s like the same thing that
at a core observability
and monitoring, it’s all trying to answer
there, right?
Other industries have the same
challenges, too.
And maybe that’s why it’s easier for me to think
about this in more of a philosophical sense.
When I was an intern, I was working on a project with
an
anhydrous ammonia, which is not
something you want to be around if it goes
wrong. It’s a bad day because
it’s a very dangerous thing.
It’s using farming and fertilizing.
And it’s kind of the same thing you were just
talking about, right? There’s a very different risk
factor there. We were doing something because
farmers are
less, they’re very prone
to just doing what they want because they want
to do it and like customizing stuff. So you have to be really
careful because it’s not just like somebody
randomly shifting wrong on the interstate
or on the autobahn. It’s like, oh, I want
this just to work differently. So I’m going to go in and like,
change how this works, types of things. And it
is, it’s just, it’s different failure mode, but it’s like the
same problems. You know, you have, you have the same thing
like, is this working the way it’s supposed to? You have all of
these problems you have. Then you want to build
systems to prevent it from, from being,
you know, like in that case, like a safety concern.
But, you know, it’s fundamentally like you
want to build processes to prevent
and be aware of when things aren’t working right.
Well, it’s like, that’s like the same thing
we’ve just been talking about in the software industry.
You know, like, yes,
it’s the exact same thing. It’s now
certainly that the tooling is different.
You know, you’re not going to put like synthetic
transactions that, you know, going back to that
on these sorts of things, but like
it’s the same fundamental philosophy.
And that’s, I think earlier I was
talking about that. And I think back to that point
of it’s the way you like help people
influence people is you just think about it like this.
You know, I think
all of us probably just kind of naturally
think about this as kind of a risk
assessment or prioritization assessment
when you’re talking to people.
Like, I think we all naturally think of it that way
just because of the nature of our roles. But
I don’t know that a lot of people
see it as like a, it’s just more, well,
I’m not directly responsible for this. So
I have a whole bunch of other work. So I guess
I’ll do that. You know, I don’t think people are consciously
doing this. But that’s one of the things I’ve talked to a lot of people
internally here about is, if this
is good and you want to make it, if you’ve believed in
this as a priority, then you have to shift priorities.
Period. You can’t just like tack it on.
Okay. Dear listeners,
as we were recording this podcast,
we discovered that we
couldn’t really do the topic justice
in just one episode. And so
while we were doing the recording, we
decided we were going to keep going
and record a second half
of this conversation and publish that
as a separate episode.
So this is the end of this
first episode, which was maybe on
a somewhat higher level,
on more of a level of
philosophy of culture. And in
the next episode, we will
try to go down to a more
practical level and talk about
how to actually approach, how to actually
implement observability
in your own organizations, in your own products.
So please stay tuned for
the second episode of this conversation,
for the second half of this conversation with Alden,
which I think you’ll enjoy
greatly. I know I did.
Thank you.