DevOps Benchmarking: You Can’t Manage what You Can’t Measure

Most apps start small.

Small is nice. Small is simple. Remember when your applications were small and nice and simple and… manageable?

Me neither. It happens so fast.

At some point along the way, somewhere between tiny-little-projecty-idea-thing and commercially-viable-business-with-real-customers, things get complex. While our business teams are celebrating “scaling”, your development teams is acutely aware of the challenges that arise from growth -- exponential increases in infrastructure and system complexity, code volume, and technical debt.

So, we go into “Management Mode”. And we think “hey, all of these services we’re using, they provide event streams -- this will be fine, I can get the data!” I can measure it. Then you realize capturing and interpreting all the data streams are really time consuming. What may start as a single script using the Github API can quickly grows into event streams from hundreds of developers. Suddenly, you’re building a full-blown internal tool for development team measurement.

At Allstacks, we’ve spent years understanding the software development lifecycle and how it often resembles a winding trail through the woods rather than an assembly line. As we baseline a customer’s dev lifecycle, we typically find an immediate opportunity for improvement in CI/CD. So many teams are hamstrung by slow builds and manual deploys!

First, we start monitoring build times. If you decrease build time, cycle time is reduced overall -- meaning your developers can get back to coding faster. But build times don’t tell the whole story. What if build times are fairly short, but builds don’t actually begin for quite a while? It turns out a better metric to monitor may be what we call Build Delay.

Shorter build times lead to faster iterations in a vacuum, but Build Delay is a nuanced metric providing an answer to what engineers actually care about: “When is this actually going to be done?” If someone kicks off a 2 minute build but it takes 30 minutes to start, that’s 32 minutes before your developer can iterate on their work. Shortening that delay positively impacts the developer and everyone else downstream -- including your customers.

Evaluating the context of the metric and understanding the business impact of the output leads to healthier, more informative measurements. At Allstacks, we’ve seen customers recapturedozens of hours per month of productive time by focusing on driving improvements to Build Delay. It’s amazing how valuable a little clarity can be when things have gotten complicated.

So, yeah -- you can’t manage what you can’t measure... but you also can’t improve what you don’t understand.

This post was originally published on the Clear Measure Community Blog.

Jeremy FreemanEngineering