Observability (kurz: O11y) ist eine wichtige Komponente von DevOps. Darüber sprechen wir mit dem QA-Experten und Observability-Fanatiker Alden Peterson von Netflix.
Dies ist der zweite Teil der Unterhaltung, die wir in der vergangenen Folge begonnen haben.
In this episode of the DevOps podcast, the hosts Luca Ingianni and Falko Werner, along with their guest Alden Peterson, delve into the practical side of observability within DevOps. They transition from a philosophical overview to concrete applications, emphasizing the importance of understanding and defining problems in DevOps environments. The discussion includes practical advice on starting with observability, the significance of synthetic transactions, and the integration of various elements like monitoring, logging, and security. They also explore the challenges in quantifying observability’s impact and the role of engineers in enhancing system reliability and customer value.
Inhalt
Introduction to Observability in DevOps
Transition from Philosophical to Practical Aspects of Observability
Starting with Observability: Challenges and Steps
Role of Synthetic Transactions in Observability
Monitoring, Logging, Alerting, and Testing in DevOps
The Impact of Observability on Engineering and Product Development
Integration of Security in Observability (DevSecOps)
Quantifying the Impact of Observability
Case Studies and Real-life Applications
Closing Remarks on Observability and its Future in DevOps
Transkript (automatisiert erstellt, wer Fehler findet darf sie behalten)
Observability (kurz: O11y) ist eine wichtige Komponente von DevOps. Darüber sprechen wir mit dem QA-Experten und Observability-Fanatiker Alden Peterson von Netflix.
Diese Unterhaltung war so fruchtbar und interessant, daß wir uns ein weiteres Mal dazu entschlossen haben, sie in zwei Teile aufzuspalten.
In this episode, the hosts, Luca Ingianni and Falko Werner, along with guest Alden Peterson, delve into the nuances of observability in software engineering, discussing its intersection with QA, DevOps, and SRE. They emphasize the importance of understanding and managing risks in software deployment, highlighting the shift from traditional QA towards a more integrated approach within the software development lifecycle (SDLC). The conversation revolves around the challenges in creating confidence and trust in software systems, the role of observability in different stages of the SDLC, and the impact of team structure and organizational priorities on software quality.
Inhalt
Introduction to observability and its relation to QA and DevOps
Alden Peterson’s career journey and perspective on software development
The concept of observability in software engineering
The role of QA in the context of modern software development and deployment
Integration of SRE practices and philosophies in QA and DevOps
Challenges in maintaining software quality and reliability
Importance of confidence and risk management in the software release process
Tactical and philosophical approaches to improving software observability
Organizational structure and its impact on software development and observability
Transkript (automatisiert erstellt, wer Fehler findet darf sie behalten)
Welcome to a new episode of DevOps auf die Ohren und ins Hirn. Or in English, DevOps from your ears straight to your brain. Usually, this is a German-language podcast, but we sometimes make exceptions for international guests. Today is one such exception. My name is Luca Ingianni, and I host this podcast together with my colleagues Dirk Sörner and Falko Werner. We’re DevOps consultants, trainers and coaches, trying to get teams to work together better and create better products for their customers. Today, it will be Falko and me running the episode. And we’re joined by Alden Peterson, who has lived in the DevOps and SRE and Observability space. Alden has worked as an SDET, Software Development Engineer and Test, Systems Development Engineer, and now a Senior Software Engineer, but has always been passionate about working in the QA and Observability space. Whether that’s building tools for production monitoring, DevOps tooling, or helping teams understand observability, he has spent most of his software career working to help teams have better insights and confidence into operating and building software. Alden, thanks for being here. Yeah, thank you. Thank you so much for having me. So that was quite a mouthful of an introduction. That sounds like an awesome career. To be honest, you know, I started my work as a test engineer, and to me, it’s still the most fun of all the different aspects of software development. Is that the same for you? Yeah, I like, I like the general space. It’s interesting to me to see some people really like writing feature code, and like they just live for that. For me, I get bored because it feels like you’re doing something else. Yeah, I get bored because it feels like you’re doing something else. Yeah, I get bored because it feels like you’re doing something else. I get bored because it feels like you’re doing the same thing over and over again. And I understand why people would love that, because there’s something to create. You know, I think one of the challenges, whether it’s DevOps or testing or observability, is you have a lot less of a tangible thing that you’re making. You know, you might make a dashboard, or you might make some metrics. And for a lot of people, that’s just not satisfying. You know, it’s like you spend two days or a week, and then what do you have? You know, if you write feature code, it’s like, look at this beautiful UI I wrote. Or look at this, like, great service I wrote. And then, you know, kind of in this… This whole space of, you know, supporting engineering work. I can understand why people don’t like it. I personally think it’s fun, because I feel like you get much more interesting challenges, because you’re always doing something different. You know, you’re doing kind of the same ultimate goals, but all of the tools are always different. So for me, I’ve loved it. You know, my career has been very strange. Even before I was in software, I worked as a manufacturing engineer. So I worked… You know, my background is actually mechanical engineering. So I’ve kind of had, like, a very… Yeah, I’ve had a very… You know, roundabout way to get into tech. And I just… I think it’s interesting. It is interesting, I suppose, that it seems to me… Maybe that’s some kind of observation bias, but there’s a lot of mechanical engineers in, broadly speaking, the QA space. Yeah, I think that of all of the engineering disciplines, mechanical engineering is the closest to software, because, you know, part of the reason I ended up in software is because programming solves a lot of problems, in the mechanical engineering space, too. You know, in my undergraduate, I had some internships where I basically programmed stuff. This was in VBA. So if you’ve ever worked in Excel and VBA, that was my introduction to learning how to program. And it was something. But, you know, there’s a lot of business problems that you can solve in that space. And I think, like, mechanical engineering, of all of the engineering disciplines, has more of that. And I think, too, a lot of people have this grand idea in that I’m going to design, like, you know, cars or tractors or, you know, rocket ships. And then, you know, you realize that, well, you might design, like, tiny components that go into, you know, a larger component of that machine. And I think a lot of people get a little disillusioned by that, just because, you know, it’s kind of like what I was talking about before. You want to own this thing. And it’s like, I don’t really want to own, like, you know, a tiny subcomponent of a subcomponent type of thing. Yeah, what I like about software and mechanical engineering, since I’m a computer science for engineering master, so… I see that it’s helping, like, developing software for creating and designing mechanical systems and managing it. And that’s also a part where you see the difference and the ways of integrating both ways of work. Absolutely. Right. Anyway, this was a very roundabout way, I think, of getting to today’s topic. Which is supposed to be observability. Alden, what the hell is observability? Yeah, so I think it’s an interesting question. So I’ll give a couple different answers, because I think the answers, the most common answer, I think, is more of, like, maybe I’ll call it, like, the classical SRE approach. So, you know, kind of like Google’s SRE book, which is, I think, a little bit higher level than what I would personally define, but kind of what is going on, right, in the system. And whether that’s trying to get to some sort of, you know, kind of, you know, kind of, you know, kind of, you know, you know, error budget or, you know, uptime percentages or, you know, monitoring, alerting. I think that’s where a lot of people gravitate. I think, from my perspective, it’s more like, what is giving you confidence that what you’re building and operating works? You know, is that, and this is where I think, you know, when we first met, I think some of our conversations are, you know, so I’ve kind of lived in the QA space. I don’t, I have very strong opinions about QA. And I think longer term QA and SRE and even DevOps all kind of merge together, because they’re all kind of answering that same question, which is, how do we know that we can have confidence that what we’re building works? And then also, how do we actually do it, right? So I think, like, there’s an aspect to observability, which is kind of philosophical, but then tactically, I think it breaks down, and I can talk more about this, too. I can go on for a long time, because this is interesting to me. But, you know, there’s, I think QA is a big part of observability. I think DevOps, which is very much an overloaded term, but the DevOps, like, methodology, I think, is a lot of observability. You know, if you’re going to, if you’re going to fully do CICD, you have to know what’s going on, right? You can’t just, like, ship code to prod and just, like, cross your fingers that it works. You know, if you’re doing really continuous deployment, you need to have some idea that what I just deployed works. Maybe that’s metrics coming off the service. Maybe that’s some smoke testing. Maybe that’s, you know, some sort of synthetic monitoring, so you’re running customer, you have to have something. Otherwise, you have no idea what you’re deploying to prod. And I think, like, that’s an aspect to observability. And I guess I’d be curious what you both would define observability as, too. I found that it’s one of those terms that a lot of people have very different interpretations on, kind of based on their background and where specifically they look at. Yeah, that’s an interesting question. So, because similarly to DevOps, I’m kind of struggling how to define observability. I suppose some people define it from a very technical point of view, right? It’s, you know, it’s the thing with the dashboards or something like that. Same, like, to question, quite a few people, DevOps is about CI pipelines, which in a sense it is, but on the other hand, those people, I suppose, are missing the point, aren’t they? And maybe that’s the same way it works for observability, isn’t it? Yeah, I think from when I think about stuff, whether it’s observability, QA, and even really even DevOps, like, I think it’s easier to define it with kind of, like, the high-level vision. So, like, I’ve talked with a lot of folks on my team that my job is to bring confidence to the SDLC. So, whether that’s building, whether that’s releasing, whether that’s deploying, whether that’s testing, all of that sort of parts of the SDLC. And I think observability is how you actually trust that your confidence works, right? You can’t, you know, you can build this really cool system, but if you don’t actually understand what the system is doing, like, are you really going to get that confidence? And I think for most people, that answer is no, especially when you start considering, okay, well, maybe the engineers trust this system. And, like, candidly, I think engineers are much more willing to trust really opaque, like, black box systems that, you know, they understand it worked at one point and it probably works now. But I think once you start expanding more towards product management or engineering leadership, there’s not quite the same confidence that that gives because, you know, some sort of bash script that’s in the build process that does a bunch of magic isn’t really going to give your PM a lot of confidence generally. And so I think, like, some of the outward artifacts of, you know, what are we testing? What are we monitoring? What are we seeing about our production system? How do we debug errors? I think you do need something more dashboardy. I don’t like calling it just dashboards because there’s so much more to this than just dashboards, but you need something more tangible so that you can actually have conversations with people other than the people who are very intimately aware of the system. That’s so interesting. This whole conversation about trust, like, can you trust your system? Can you know that it works, like, you know, on a visceral level? Yeah, that’s the word I often use is confidence because I think, I mean, if you work in the tech world very long, you’ll know, like, you’re never going to have 100% confidence, right? Because, you know, like, last, was it December? Log4j, right? I don’t think anybody on the planet would have thought, oh, you know, I’m going to expect that a core library in the Java library is going to cause a giant headache. It’s like, there’s reasonable assumptions you take. And I think when we start thinking about confidence, one of the reasons I like thinking about it in this way is that it’s just a risk analysis. Right? You’re wanting to lower the risk of whatever, you know, whether that’s buggy features, whether that’s, I don’t know, one of the things we’ve been dealing with is someone’s basically doing a denial of service attack on one of our services. You know, you’re wanting confidence that you can handle those sorts of things, but also that it just works. And I think one of the interesting parts of observability, and this goes across whether it’s kind of the DevOps aspects of observability, the SRE aspects or the QA aspects, is you could spend years on a relatively basic system to get 100% confidence. And you won’t get 100%, but maybe you get five nines of confidence or something silly. You know, but like all of us work in companies where we can’t just dedicate that much effort into things. And so you have to be very pragmatic about, okay, what is actually important to do? And for me, I think that’s an interesting thing because it’s just, you have to make analysis around what gives you that confidence. And whether you call that trust, I think trust is a very similar thing. I think I would probably call that the same. I think Luca used the word trust, confidence. The word, I like the word confidence because I think for non-engineers, I find that word resonates more because I think people outside of the engineering space are like, I don’t know. Like, I don’t know if this is going to work. Like, you guys are releasing this large chunk of software and how do we know it’s going to work? You know, I think there’s a confidence factor there that it just, it resonates more. But I think it’s basically trust, right? It’s both. It’s both feelings from what I would say. So how do you get those types of, feelings into people? So what do they need to be trusty or to have confidence in the product? So that’s a really interesting question because I think that can depend entirely on what the existing failure modes have been, as well as kind of the more holistic, what should you be doing with, you know, I’m saying should, should is maybe not the right word. What is the ideal instrumentation, telemetry, whatever observability aspects? You know, I think one of the interesting things, I joined my current team the end of last year and it was interesting because some of the major pain points that my team was having were very specific to certain failure modes. Like we were just, we had random production blips, like stuff just, and it was like two, three minutes. And so it was very hard to track down. And so it was from a confidence perspective, it’s a lot easier if you say, okay, well, I’m going to like prove that those are happening because now you can prove that they’re not happening. And so on my team, that’s a very, that was a very key thing for me when I first joined was, okay, let’s set up some just like really basic checks to see what our site’s doing on a regular basis. And so in that case, that’s a major confidence booster. Well, it’s a major insight into what is causing the lack of confidence, I should say. Knowing that your site goes down on knowing the exact times is actually confidence inspiring versus it just randomly goes down and we don’t understand it, but we know we get random customer reports that we can’t really reproduce. And I think like one of the challenges, and so one of the reasons I think like the QA, SRE, DevOps space all kind of eventually merge together, is because I think one of the pitfalls of having those as separate orgs or separate silos or separate disciplines or whatever you want to call that, is that each of those has a very different answer to the question that you just asked. Because QA, for example, is going to say, well, any bugs are problems. You know, DevOps, if I broadly generalize and say anybody with a DevOps title, which we could probably have a whole conversation and that nobody should have a DevOps title, but that’s a common enough thing in orgs still, that that’s like, well, the infrastructure, underlying infrastructure is stable. And, you know, if you’re an SRE, like how do those kind of interact together? But like you’re more looking at much more of the customer impacting aspect of that. And I think like all of those are parts of the answer to the question that you just asked. And I think one of the challenges is, especially in an orgs, if you’re like a dedicated QA org and a dedicated like some sort of infrastructure organization, you end up with each of those groups caring about very specific windows into that observability question. And then I don’t know that I’ve seen, at least in my experience, I would actually be curious if you both have seen this, but I don’t see a lot of reconciliation of how all of those pieces fit together because they all do fit together, right? You know, if you have an, let’s say your infrastructure is bad, you know, and one of my previous companies, we dealt with this all the time. We ran into all sorts of failures on like test deployments because the infrastructure underlying the whole thing was just flaky. And so is that like a QA problem? Is that an infrastructure problem? It’s like, no, it’s like, it’s this confidence observability issue, right? Yeah, but isn’t this exactly this model of the wall of confusion that, you know, is this sort of founding myth of DevOps, right? That you don’t make this distinction of, you know, production being up is Ops‘ problem and bugs not being present is Dev’s problem. And you replace that by the common perspective, are we providing value to the customer? By, you know, shipping a feature and making sure that it’s stable. If we can’t achieve both of those, then we’re all, you know, missing the point, aren’t we? Yeah, I completely agree with that. That’s one of the things I struggle with is what do I describe myself as? You know, like, I don’t want to be a, I’m not a DevOps engineer. I’m not a QA person. And so I’ve started calling myself an applied SRE. This isn’t really the right terminology because there’s a whole bunch of pitfalls with that title too. But like, I like the idea. And, you know, Luca, to your point, I think you want someone on the team who can think about those sorts of things. And like, that’s the team question, you know, like the full stack engineer approach has become really common. And that’s in an ideal world. Then your team owns, you know, your infrastructure, you own your deployments, you own everything. And then the team can be responsible for that. But you need that team, someone on the team to kind of understand and care about this issue. And that’s, in my experience, one of the challenges in this space is that there’s so much going on from a prioritization perspective that is very challenging for folks to really, not so much care, but have the ability to care. You know, if you’re required to ship features and kind of you’re accountable for features, observability can very easily slip away. And whether that’s observability through like testing, whether that’s observability through monitoring, whether that’s pipeline, you know, like the more quote unquote DevOps-y side of the release process, it’s very easy in my experience to see that kind of slip away as a secondary concern, which, you know, I’m pretty pragmatic. Sometimes that is the case. You know, if you’re a startup, like you might not have the budget to like survive long enough if, you know, you don’t ship feature code. You know, you might, your company might die. You know, you might go bankrupt if you don’t have feature code. In which case, yet again, you’re not providing value to the customer. So you can still very easily loop it back to, am I offering the best value to my customer, either through stability or through my existence? Yeah, and that’s like, I think one of the things that I like doing, I haven’t really thought of the phrasing, the value to the customer. I think that’s very similar to kind of what I’ve thought about this because you want to determine like what actually matters. You know, not being like, oh, we need monitoring because we need monitoring for the sake of monitoring. Like, I think all of us would probably agree, monitoring is most always valuable, but that doesn’t mean that every single project needs somebody who’s, you know, creating tons of dashboards and tons of metrics. Like, because at some points it just doesn’t matter. You know, if you build an internal app, you know, one of the apps I used, I wrote years ago now, was a team. I sat on the team. I was on the team I was writing this for and I had some like really basic error handling. So any error got emailed to me. You know, I wouldn’t really say this is like the best observability, but it was good. It was good enough for that team because the most, you know, they sat next to me. You know, if something really went wrong, they would just walk over 10, you know, a meter, five meters, whatever, and go say, hey, Alden, what’s up with this? It’s like, you know, there’s a different level of observability there. And I think it’s just, it is a challenge because I have found, especially in the QA space, that there’s a very like idealistic approach when it comes to a lot of the things. It’s like, you need to do this because this is important. And it misses the practical realities that everybody’s busy and there’s, there’s almost always more than you can do than actually do. And I think really to what you, the way you’re describing it is kind of, I actually like, I’ll probably steal that and use that terminology because I think that is like a better way to talk about that trade-off. It’s also interesting what you talked about in terms of, do you treat this as a QA problem or how do you look at it? It reminds me of myself when I decided I was going to be a freelancer. I thought, you know what? I’m going to be a professional QA person because testing, it’s fun. This is, you know, this is going to be awesome. And I came to the shocking realization that I could not actually offer quality to my customers unless I took control of the entire SDLC. And so I accidentally found myself doing DevOps by necessity. That was, that’s so funny because that is how I got into DevOps, sort of falling backwards into it. Yeah. And that’s very similar to my approach. My first, I guess, you could call it a QA job was when I was hired as an SDET or Software Development Engineer and Test, which speaking of titles that are kind of meaningless. But, you know, one of the first things that I did on that team was I realized like the PR process just was bad because the build took forever. And so people would, you know, just not really pay attention to it. And it was just a headache. And really, I think one of the first things I did on that team that really improved the overall quality of the products, I just reduced the entire CI process time. But, you know, and in that case, nobody had looked at it. It was really, it was years later, so I can say, it was pretty easy if you just look at it and spend the time to do it. I think it was something like an 80% reduction in the total CI time. And it’s like, nobody just put the time in to look at that. You know, I think part of it is, that is like a QA concern, I think. I’ve talked to a lot of people, like, I think QA, you end up going to the DevOps side for the exact same reasons you were talking about, Luca, that you end up with, okay, so I can write a bunch of tests, but like a bunch of tests don’t matter because when do I write, when do I execute these tests? How do they get executed? Well, you have to think about where the deployment process is. Okay, now you’re talking about the deployment process, so now you’re thinking about the pipeline, if you’re using some sort of pipeline thing. And now suddenly, you’re thinking about like the whole DevOps philosophy. And I think that that’s almost a requirement if you want to have like a QA process that’s not just write some tests that get executed against like a, you know, I don’t know, a build or a test environment. Yeah, okay. So, what is the requirement I don’t know, system of observability looking into? Is it all of those things? The software itself? The pipeline? The people? I think, yeah. I think a lot of this will again depend on what your role is on the team. Because I think like leadership will say it’s all of those. I don’t know that, I mean, it’s a challenge, right? Because where does that, where does that bubble stop? You know, one of the things that I’ve, I’ve often told people is that most of the hard technical problems aren’t really technical problems, but they’re like a mix of how people and technology are interacting. And I think observability is almost perfectly in this, right? Like one of the things that I’ve talked to a lot of people about is I tell people, if you’re not going to action an alert, either turn it off entirely or put it into a channel that’s basically like ignored alerts or something. You know, some people just have like an obsessive, I really want to see all these alerts because I don’t, I don’t understand it. I can’t empathize with it because for me, any sort of alert noise just drives me nuts. But some people really have a hard time turning off an alert that they know is valuable but don’t do anything with. And I think like that’s a very interesting combination of like a people and a technology thing, right? Because you can have a bunch of valid alerts that should be done something with. But if nobody’s doing something with them, is that a technology problem or is that a people problem? And I don’t think you can answer that either way. I think it’s a mix because, you know, maybe there’s an aspect to people want to, but they’re just so overloaded because they have so many, maybe product management is pushing a ton of like we need to do feature, feature, feature. And, you know, the poor engineers are just feeling guilty about these alerts they’re ignoring. You know, that’s probably not a common situation. Sometimes people just don’t care. You know, some automated template creates a bunch of alerts and then there’s, you know, either they don’t understand why the alerts could be useful or they don’t correlate to any sort of, you know, customer impacts. So it’s kind of just like, well, whatever. And I think it’s, I think it’s definitely a mix. I think it’s easier to focus on the technical side. And I think this is one of the things where I think, you know, going back to the beginning, I think a lot of observability stuff that I’ve read focuses more on the technical side because it’s easier to quantify. You know, it’s easier to look at, okay, you need to, okay, like let’s look at availability and uptime, right? Like this is a good example. It’s, you can measure it, you know, and people like metrics, you know, it’s a good, and the metrics are valuable and it’s easier to say, okay, we’re making an impact because we went from, I don’t know, 98% uptime to 99.5% or, you know, something like that. And I think like a lot of times the people aspect can get missed because it’s harder. You know, you’re both consultants, so you probably deal with this all the time. You’re basically doing, I would guess, and this is speculation since I’m not on your roles, but I would guess that the technical parts are really easy. It’s the parts, like getting people to adopt them and changing people’s either preconceived notions or some of the, you know, organizational challenges that make doing more of DevOps harder. Like that’s, that’s been the experience I’ve had. I would, I’m actually really curious if that’s how you would both say it to you from kind of the consulting side doing on that DevOps-y side. Yeah, it’s both, I think. You have to have people with the skills to get the things up and running and to find a good starting point to measure. I think sometimes you have no measurements at all except how many bugs or service requests or, or issues are coming in. So that’s kind of the first level of transparency or measurement or looking into service availability or uptime, whatever. And then when you start, you get into it step by step and need to develop skills in terms of tools, in terms of how to work with the data that you can get. And measure the, the important things or at least alert on the things that are helpful for the customers, for the team as well. I would even want to step one step further back and try to get a better handle on observability. I think it’s a broad term kind of getting lots of things together. You said in a quite abstract way create content, confidence, or Lukas said in also similar way to create trust or build trust. I think when I, when I hear the term without the, that deep view on the technology side, I feel it’s making things visible that are somewhere hidden in the black box system that you, that you mentioned in the processes, in the tools, in the tool chains. What is there to promote a solution to a way to make better the people in your place and one that I think we need to say it can be or can be anbeeping over I mean yeah, it’s still different now from from one view one observability i think that’s that’s good i don’t know that i would say that that’s less abstract um i think like to make it more more tangible i think it’s really like can you easily understand what’s going on and maybe that’s maybe that’s just a shorter way to say what you were saying it might just be a shorter way but i think like so one of the things i really like doing is setting up synthetic transactions and so like basically a synthetic transaction is hey what’s a common customer action or behavior whether that’s like a ui you know clicking through an actual web interface or whether that’s an api call you know whatever your customer is and just setting up a scheduled version of that because and one of the reasons i like that is because it bypasses a lot of the pitfalls of like tuning alerts tuning metrics and that sort of thing and you basically say this is what we’re defining our thing that does you know maybe i don’t know maybe it’s a you know i’m trying to think of example you know maybe maybe you have some sort of like basic i don’t know payment system we’ll run a like figure out a way to run a fake payment through that system run it every minute or every five minutes or whatever and this comes down to like what’s the customer expectations on this but you know if you run that every minute it’s kind of a non-negotiable this is what our system is supposed to do and if this fails it’s a big deal type of thing and one of the reasons i like doing that and i i think that’s a very insightful thing because at first it forces you to define what is that critical path for your service or your website or whatever you know whatever the application here is it can be a this applies across whether you run like an amazon.com level website or whether you just have a simple web service that’s kind of a you know a microservice your app does something fundamental that your customers or users have an expectation of and just verify that that works and i think one of the reasons i like doing that is because it simplifies a lot of the complexity around well what is like what what’s confidence what like all of those things it’s like no we’ve this is like what our thing is supposed to do and it stopped doing it that’s a problem you know and maybe we’ll Maybe latency matters, maybe latency doesn’t matter. You know, you can like build into that sort of really simple check a lot of useful things. And then you also have a repeatable environment to run it from, too. One of the challenges in using like real user monitoring, which I’m sure you’re both familiar with at RUM, is that maybe you have the one person who’s off on, you know, rural Internet in the middle of nowhere and their site just doesn’t work because, and it’s not because of your site, it’s because, you know, they’ve got, you know, 20 bits per second of, you know, Internet speed. And you don’t necessarily know that from metrics. And in my experience, tuning metrics can be a lot of work, especially for lower volume sites. You know, if you run a site like Amazon.com or, you know, Netflix.com or something, where you’re getting millions upon millions of like customer interactions potentially per second, it becomes a lot easier to define like actual metrics based on site interactions because you have so much data available. But a lot of us work in spaces where we don’t have that level of clear impacting like data. And in those cases… I really think like just doing like some sort of basic synthetic transaction can sort of answer the question that you’re asking, Falco, because it’s basically what is our, what does this do? Like, what is, what is the core value proposition? You know, like we were on this like podcasting app now, like maybe they should just, that like a good example there was like, just start a podcast and verify like stuff works, you know, that’s, you know, the Zencastr podcast. That’s, you know, it’s kind of like they’ve got one thing, right? Like it’s to record and create podcasts. And so. So if they wanted to know that, they could certainly try to look at metrics. I don’t know how many people use this site, you know, maybe there’s enough people using this that they could just look at how many podcasts get started and saved every X minutes and figure it out. I don’t know, but you know, my guess is it’s one of the sites where, you know, if you looked at, you know, midnight on Friday night, there’s probably not a lot of people that recording podcasts, you know, or, you know, like that’s, you know, one of my previous apps was heavily business cycle dependent. So we basically made an app that was used as a business app. Well, yeah, I’m like midnight on Saturday that almost nobody was using it. But if somebody was using it. At that time, it was really, really important because that meant some like person was doing some work that was like really time sensitive, generally. And so we had to be very mindful of that because you can look at your metrics and say, OK, great, our app is running. But then, OK, Friday night comes and no user traffic and somebody takes down the cluster by accident and now you don’t even know if that app is failing. That’s I could talk a lot about this because I think synthetic transactions are a really good way to bypass a lot of the pitfalls with observability because you’re you don’t have to tune anything. You basically decide what’s the key thing here. OK, but now I need to ask you a question because I’m kind of confused. Is there a difference between observability and using synthetic test cases and just fairly regular acceptance tests? Well, so that I think is a good question, because I think acceptance tests in some ways can be useful as synthetic transactions. I don’t know that I I’m not a huge fan of to the point where you have to. Run like a large test suite against product. I don’t I think you can do that. But I think if you’re feeling like you need to do that, it probably means you have gaps elsewhere in your process. So, like, for example, if you I guess let me take a step back. I think acceptance tests should be a lot more about verifying like functionality works and the integration pieces are correct rather than necessarily is my app like core business logic working. And so in previous companies, what I’ve what I’ve done and actually talked to people about is maybe you have 100 either acceptance integration tests. Whatever tests that you run against a deployed environment, you don’t need to run all 100 of those as synthetics, but you might just be able to say, okay, these two are actually the core. So we’ll run those two. And I think this is part of why I think like observability, QA, DevOps, SRE, all of that morphs together, because when you start thinking about it from this perspective, what you just said is a very natural thing, right? Like, let’s say you have a QA org that writes tests against the test environment and engineering org owns production monitoring, which I think is a real, really common situation, actually. Is. Yeah. So maybe at this point, we really need to pull back a little bit and kind of figure out what is the relationship between QA and SRE? And DevOps and observability, and if you want also SRE, how do they fit together? How are they, how are they traditionally viewed as separate and should they be viewed that same way still? I’m guessing your answer will be no. Yeah. So I think from my perspective, the QA industry is undergoing a similar transition that kind of the systems administrator, DevOps side of did. And I mean, it’s still it’s still undergoing. But and that is. I think people are recognizing that you can’t have the siloed component to your your your engineering org as a broader whole, because with it with SAS, I guess we should probably clarify a lot of this mostly applies to SAS, because it’s very easy, for example, to roll back software in the SAS space. Generally, you know, one of my previous companies, I worked for John Deere, and if we you know, one of the teams that I was close to the team I was on, they did the firmware update types of stuff. So they would do over the wire updates. Well, that’s a very different failure mode if you screw that up. You could potentially like brick or prevent every single piece of equipment that’s on the field from working. That’s a very different failure mode. And so I think with SAS, it’s a little easier to to trust in like a process that will auto roll back, for example. And so that disclaimer aside, I think QA historically has tried to answer like, do we have bugs in the product? And often that happens on like, do we have bugs in the product? Let’s make sure we don’t have bugs in the product before we deploy to production. I think that’s I’m generalizing horribly. But I think that’s a pretty decent explanation of what QA often is. And so you will often see orgs who have quality problems say, let’s hire a bunch of QA people so we can make sure we don’t have bugs before we ship it to the customer. Well, that’s so interesting. I’d never thought about it in those terms. But it sounds like what you’re saying is the the harder it is for you to deploy software, you know, either for reasons within your product, like, you know, it’s a tractor or because your processes are just terrible. So so the harder it is to deploy into prod, the more you need a separate QA stage and a separate QA team, but make sure that nothing bad goes into prod. Yeah, I think that’s I think that’s a fair statement. I think there is an aspect to this that if you’ve made the process hard and it doesn’t have to be hard, you create a higher chance that you have issues, you know, like if let’s say you have, like going back to this podcast, let’s say they push updates once a year. Well, it’s almost guaranteed. Like I think most people in tech would say that’s probably more likely to come back. Yeah. major issues than if they ship once a week or every commit or some variant on that. So I think you do have to be a little careful. Now, that being said, I think like oftentimes people’s first tendency is, okay, we have quality problems. Let’s batch up all of our releases so we can make sure that they work right. And you kind of get yourself into this cycle. But I think one of the things I try to be careful of is in these sorts of conversation is recognizing there are valid cases where a lot of this is, there’s just different constraints. I’m not sure, the spaces I’ve lived in and I actually much prefer because of this are like the SaaS spaces where we’re deploying some sort of web application or some sort of internal service that it’s really nice to not have to worry about a lot of like, you know, I’ve not worked for companies where there’s major compliancy things, you know, like this whole conversation changes a little bit if you start talking about, well, maybe you make medical software, you know, or maybe you make software for governmental approval processes. But I’m so I’m just kind of ignoring that space. There’s a, I’m probably not the right person for that because there’s a lot of regulatory issues there. Yeah, but let me speak to that a little bit, because I have worked on safety critical software before, like up to ACLD, which means that if your component fails, somebody’s going to die. And that does influence the way you build your product, doesn’t it? But it only really changes sort of the last step, which is actually moving it into production, like bolting it into a vehicle in our case. All of the rest is still very much the same. And, and, and, and, and, and, but all of these considerations that we talked about in terms of production software would maybe just be, you know, would be worked on in a pre-production setting, you know, if it’s a physical thing, you know, in a, in a prototype environment, in a test drive environment, stuff like that. So we’re still back to the same confidence conversation. We just need to be, we just need to be sure that our confidence level is way, way higher before we actually unleash this on the unsuspecting public, right? Right. You know, and, and this conversation has been pretty philosophical. And I think one of the things that from my perspective is the case is that a lot of the way you help influence people on this topic is just thinking about this differently. You know, I think we, we’ve had a lot of like higher level conversation here, like without as much tactical stuff. And I, I think that’s, from my perspective, that’s the hard part. The tactical part often is pretty straightforward, but that’s also very specific to whether it’s the constraints that you’re dealing with. You know, we were just talking about, you know, whether it’s safety critical software, there’s going to be different constraints than, I don’t know, making like the next social media app. But I think like one of the things is like this whole philosophy. I think if you’re really thinking about observability from a more holistic perspective and you’re really understanding what is it, what is this idea of observability trying to do? I think it really translates really nicely to any of the domains that you’re in. You know, a lot of the times it’s like, and this is something I have definitely experienced is if you’ve worked in a place where you have decent observability around your software process, I love writing code in that situation because I, I just don’t worry. You know, I trust that if I merge code, it’s going to, I’ll either, it’ll either work. It’ll either be some weird edge case I didn’t know about, or I’ll get an alert or something will fail. And I love that. And I think like that observability, like is like kind of the key to get you there. And, and again, I think like this, cause we were, I think talking about this right at the beginning, you know, what is the, where is the boundary of observability? And I think you can make it wherever you want. You can really tightly scope it to say like observability. Is production monitoring and production alerting. And I think, I mean, I guess you could do that, but I think really you’re trying to like that confidence answer is the whole process. You can even say like build time observability too. You know, a lot of people have like dealing issues with flaky tests. Like that’s a common thing and we can call it flaky unit tests, flaky integration tests, whatever. Well, like if you don’t know, like, let’s say that one of them is super flaky all the time. It’s going to, it’s going to like break your confidence in the bill and it’s going to break your confidence that like those unit tests are actually useful. And maybe you just don’t understand like, like that test fails 50% of the time. And like, I think some of the newer like APM like tools are doing a lot of work in this space. Like Datadog I know is building, it’s some sort of flaky test detector. And part of the reason is because if you can start getting rid of those false positives, you know, it’s the same as alerting. You know, if you’re dealing with production alerts, false positives are the worst, you know, especially if you’re talking about paging. You know, if you page someone with a false positive, you’re basically ruining the ability for people to care about production pages very quickly. Because, you know, it’s, it’s, it’s, it’s the natural human instinct. If you keep telling me, hey, something’s on fire and I look and it’s like nothing on fire. And then you do it five times in a row. Like it’s totally, it’s a normal human response. I think I don’t fault people at all for not paying attention or not caring about things. Like, why would you care about, you know, if you get an, if you have an alert channel, that is a stream of alerts that nobody cares about because they’re pointless. And then there’s one that matters. Why wouldn’t the world, would somebody actually look at that? But I think like, there’s an aspect to this where all of this kind of this thinking approach in this philosophy approach is very useful for creating more of the tactical side of things on like, what, what does that look like tactically? Well, it’s going to look different tactically. One of, one of the challenges I have when I talk with people is people, I’ve had a lot of conversations now where it’s something along the lines of, hey, we’re thinking about QA. We don’t really know what we want and we don’t really want QA, but we like know you’ve got opinions on this topic. What’s, what’s, what, how to think about this? And a lot of times people are basically saying like, we lack confidence in our release process. We lack confidence in production monitoring, but we don’t know how to talk about it. I don’t know, I’m sure if that makes sense, but I think like it’s, it’s harder for me to tactically like describe, you know, a lot of times I think people are looking for like, here, what are the five things I can do? You know, like I, we have observability problems, we have quality problems, we have observability problems, we have whatever. And people are like, give me the five tactical things. And I think people like jump to that. And I think it’s more, you need to think about it, like kind of the way we’ve been talking about it. And then the tactical things become really obvious, I think, because, you know, like you know, it might look like something like I’m trying to think here, maybe, maybe you have production incidents that take two hours to debug because you can’t figure out what’s going on. Like that’s a really clear tactical win pretty quickly. You know, there’s, there’s a lot of the philosophical side of things that I think translates at an organization level pretty easily. And I think that applies really like this whole space, you know, whether that’s, I’m kind of going on like a rabbit, rabbit trail, as they say. But like, you know, QA is the same thing, observability, whether that’s like the SDLC observability, whether that’s a DevOps side of things and how the infrastructure works, how that all works, like any, it’s, it’s all kind of the same thing. Yeah. And I think what the, what the thing is, is the system, which is, you know, the product itself, plus the people building it and operating it, this whole looping structure, you know, feedback structure. And this is, I think, where it closes the loop and where you say, you know, if I have flaky tests, well, that breaks that system, doesn’t it? It doesn’t break it in the, in the, let’s call it forward direction of, you know, features flowing towards the customer, but it does break the feedback direction. And as such, it will break the system as a whole. Yeah. I think Falco, you were asking this question earlier, like, is it a people or is it a technology thing? And I think this is why it’s hard for me to answer that. You know, when it, when we think about more of like, do you have production, like for more of like the production monitoring or production kind of incident responses? Yes, it’s both. You know, look at what you’re saying. Do you have runbooks? I think it’s a hard to answer at all, because the system is both the technology and the people. If you don’t have the people as the sort of reactive, creative part of the system, then you’re not going to be able to do that. If you don’t have the people as the sort of reactive, creative part of the system, then, you know, who’s going to react to problems? Of course, it’s going to be the people. Yes, of course, they need technological support, be it dashboards or be it something else that, that enables them to observe what’s going on, hence observability, and react to it in a meaningful way. But I think here we are. Okay. But are you measuring or evaluating the responses of the human or social part of the system as well? Is this part of observability? Do you look into it? So that that is a harder question for me to answer. And I think the reason that it’s harder is, I would answer yes. But I think I’m also aware that that becomes very challenging based on where people’s abilities to influence and organizations are to say yes. I think for the average engineer, I think the answer to that’s more of a no, because so much of that is organizational and prioritization. Like, for example, I think we at the very beginning, we were talking a little bit about whether if you’re being pushed to create features. You know, if you’re getting like, let’s say 100% of your bandwidth is allocated towards feature development, and you want to put time into, say, observability or monitoring or testing or any of that sort of stuff. It’s hard for me to say that that’s the engineers problem to deal with. That’s a much more of an organizational issue as far as how the organization prioritizes the SDLC engineering work. And I think, so I think it depends on your role. And I don’t. So if like, let’s say if you’re an engineering manager listening to this, or a Let’s say, if you’re an engineering manager listening to this or a PM listening to this, I think the answer is a lot more of, yeah, it’s your responsibility to balance. We had a relatively minor incident recently, but I was very glad we had it, because it was a very clear okay, these are the areas we have problems addressing this incident. And if we have a real issue, unlike when this product is more actively used, we’re going to have some serious issues because of these things. And that was very easy for our PM to go, wow, yeah, okay, let’s prioritize. And so I think, I don’t know if I’d call that a metric. It’s not a metric. You can’t quantify it. What are the gaps? But being able to quantify the risks I think is really helpful. And you can call that a metric. I mean, you can’t put a specific risk to it. But I think that’s one of the things from my perspective that I think is really important in this whole thing, too. Because let’s be honest, the reality is we’re never, most of us work in non-life-or-death situations. Luca, you were talking about that people could potentially die in your thing. Most of us don’t work on that. That type of system. And so the reality is, when we talk about the risk trade-off of how do we get confidence, the desired confidence doesn’t have to be 100.0%. Most of us are, I don’t know what percentage, I don’t want to put a number here because I’ll get a bunch of people saying, oh, it needs to be slightly higher. But most of us are below 100% for what we need from a business perspective confidence. But you need to be able to more tangibly say, okay, we don’t have good logging, we don’t have good monitoring, we don’t have good understanding. So if we have another issue in Pro, we might have to wait several hours to make a new bill that has more debugging or something like that. I think being able to quantify that from an engineering perspective can be still useful. But ultimately I think if you live in an organization where the value incentive structure is just not going to support that, I don’t think you can really you’re kind of stuck. And so that’s where I think it’s hard to say it’s a people problem as an engineer. But I think, yeah, if you’re in the leadership role or you have influence on this, it’s absolutely a mix of both. I’m also sort of thinking, back to when I was working in safety-critical products, having this awareness that people’s lives depended on you doing your job well gave a lot of clarity to those conversations because, you know, it was much easier to say, no, I don’t feel confident shipping this yet. And everybody would be like, oh, okay. It’s like, okay, fine, we have deadlines, but also we potentially have dead people. So I know where my trade-off lies. One of the reasons I don’t like having a separate QA org is because I think you want the QA org to be responsible for answering that same type of, more of a risk analysis rather than a we don’t want bugs. Because in companies that there’s a separate QA org, I think what happens is QA feels responsible for stopping all bugs, period. Not stopping enough bugs to justify still shipping. And so I think one of the reasons I’m pretty passionate that I think QA should be ultimately reporting to some, the leader who’s responsible for the product, whether that’s a first-line engineering manager, whether that’s a director, it doesn’t really matter. But I think QA is providing that insight. And this goes back to your question, Falco, really about people are process or people are technology. Yeah, QA needs to bring the risks up. But if ultimately product or engineering or whoever’s making that decision says that’s an acceptable risk, I don’t have a problem with that. As long as it’s an intentional decision. I do have a problem if it’s a non-intentional decision and it just happens. And I think that’s unfortunately often the case where people don’t consciously think about it and just say, well, we need to ship this away. Yeah. And yet again, this reminds me of like one of the projects I was working on was a steering column lock. So it was an amazingly stupid piece of hardware. It could go click or clack, right? And there was a very because it was safety critical, there was a very deliberate, very careful risk assessment. And of course, the assessment said, actually, it opening at the wrong point in time is completely irrelevant. Like, whatever, let it open. The only people who are going to complain about it are the car insurance, whatever. But if you’re going 200 kilometers an hour down the autobahn and this thing closes, you’re going to have a bad day. And having this clarity about your risk assessment. And so for the same sort of bolt moving, we had very different risk assessments for it moving forwards and for it moving backwards. One was whatever and the other one was highly, highly safety critical. And by the way, even in this case, we couldn’t possibly go for 100% safety because that would have required infinite effort. Yeah, that’s a good example, too. We’ve been talking about this. One of the interesting things to me in a lot of this space, and I think having a mechanical engineering background, we’re both coming from a very different background, but it’s observability. I think in tech, there’s a common thing like, oh, this is like, we are like unique, special industry. We’re the only industry that ever deals with problem XYZ. And I don’t know, maybe it’s because I came from a different industry. I just, it’s not that way. And so it’s that issue you’re talking about, I think it’s like the same thing that at a core observability and monitoring, it’s all trying to answer there, right? Other industries have the same challenges, too. And maybe that’s why it’s easier for me to think about this in more of a philosophical sense. When I was an intern, I was working on a project with an anhydrous ammonia, which is not something you want to be around if it goes wrong. It’s a bad day because it’s a very dangerous thing. It’s using farming and fertilizing. And it’s kind of the same thing you were just talking about, right? There’s a very different risk factor there. We were doing something because farmers are less, they’re very prone to just doing what they want because they want to do it and like customizing stuff. So you have to be really careful because it’s not just like somebody randomly shifting wrong on the interstate or on the autobahn. It’s like, oh, I want this just to work differently. So I’m going to go in and like, change how this works, types of things. And it is, it’s just, it’s different failure mode, but it’s like the same problems. You know, you have, you have the same thing like, is this working the way it’s supposed to? You have all of these problems you have. Then you want to build systems to prevent it from, from being, you know, like in that case, like a safety concern. But, you know, it’s fundamentally like you want to build processes to prevent and be aware of when things aren’t working right. Well, it’s like, that’s like the same thing we’ve just been talking about in the software industry. You know, like, yes, it’s the exact same thing. It’s now certainly that the tooling is different. You know, you’re not going to put like synthetic transactions that, you know, going back to that on these sorts of things, but like it’s the same fundamental philosophy. And that’s, I think earlier I was talking about that. And I think back to that point of it’s the way you like help people influence people is you just think about it like this. You know, I think all of us probably just kind of naturally think about this as kind of a risk assessment or prioritization assessment when you’re talking to people. Like, I think we all naturally think of it that way just because of the nature of our roles. But I don’t know that a lot of people see it as like a, it’s just more, well, I’m not directly responsible for this. So I have a whole bunch of other work. So I guess I’ll do that. You know, I don’t think people are consciously doing this. But that’s one of the things I’ve talked to a lot of people internally here about is, if this is good and you want to make it, if you’ve believed in this as a priority, then you have to shift priorities. Period. You can’t just like tack it on. Okay. Dear listeners, as we were recording this podcast, we discovered that we couldn’t really do the topic justice in just one episode. And so while we were doing the recording, we decided we were going to keep going and record a second half of this conversation and publish that as a separate episode. So this is the end of this first episode, which was maybe on a somewhat higher level, on more of a level of philosophy of culture. And in the next episode, we will try to go down to a more practical level and talk about how to actually approach, how to actually implement observability in your own organizations, in your own products. So please stay tuned for the second episode of this conversation, for the second half of this conversation with Alden, which I think you’ll enjoy greatly. I know I did. Thank you.