Monitoring is not an add-on

Chris Coveyduck

10 Mar 2023

I recently gave a talk to one of the UK's largest building societies about developing highly scalable, resilient applications and thought it would be an interesting topic for a post. As someone deeply involved in Site Reliability Engineering and implementing data driven operations I've seen how ineffective many approaches to observability can be, so I hope this will provoke some discussion on the subject.

What is 'add-on' monitoring?

Time and time again I see monitoring, or observability, designed and implemented after an application or service has been written and handed to some pour unsuspecting team to manage.

In almost every case their approach to implementing observability looks something like this: monitor the infrastructure so we know when it's up; send some alerts when it's saturated, or there's some other event that has little to do with whether users are having a bad day or not.

They create dashboards, sometimes displayed on big screen TVs in the office, with graphs and statistics that make management feel warm and fuzzy about life as they walk past.

Yet in all my years experience doing this type of work I've yet to see that approach beat a user to calling into the desk and complaining that the system is down, which does sort of pose the question, what's the point?

Monitored by design

In my opinion, unless observability is built into the application when it's designed, and with a focus on surfacing the metrics and measures that can actually tell you when it isn't, or just as importantly 'is' performing well, we never really benefit from the investment.

Whether you use Log Analytics, Splunk, Elastic Stack or whatever, these are expensive observability tools to run and maintain and there needs to be real value delivered or it's just for show.

To showcase an example of how this can be done better, I recently delivered a project where we built an application to process about 15m transactions a day with a throughput of well over 4TB.

During development I took the time to work with the developers to ensure they decorated their application with the right metrics to inform how each component is performing in real-time.

I had them follow some simple principals for observability like always emitting at least one metric for Latency (how long is my transaction taking), Throughput (how many transactions am I doing) and Error Rates (what percentage are failing) to allow the SRE team to build genuinely meaningful monitoring and observability tools and automations.

We delivered the application and built to our defined service level objectives for resilience and scale faster because the right feedback was available, in real-time for the relevant teams.

Azure Kubernetes and Azure Monitor

I thought I would finish this post by saying that my 'preferred' platform for delivering highly scalable, resilient applications such as the one I mention above is Azure, and in particular Microsoft's AKS and Azure Monitor.

Imho containerising application services, or micro-services if you are going there, has been a no-brainer for a long time. To me it's just easier to package an app as a container image and schedule it than to faf about with any of the other alternatives such as Functions and Webapps. (Queue trolling from disparaged and disgruntled Function and Webapp fans.)

If you're going to containerise applications for Azure you're going to be using AKS which is great because the Prometheus scraper for Azure Monitor works like a charm and makes it incredibly simple to consume metrics emitted on an HTTP endpoint by each container. And with the data stored in Log Analytics you can build the necessary automation to drive your highly scalable and resilient application.

And the best bit? You still get to have dashboards, installed right outside the managers office with fancy graphs whilst the real action happens automatically based on data pumped into Log Analytics.

Summary

So in summary, please think about whether you want to spend time retrofitting add-on monitoring into modern, cloud based services, or whether it would just be easier to have the app developers do it right the first time when they build them.

Please comment and be respectful, these are my opinions on what work for me, and we can all benefit from an open and inclusive discussion.

SRE Architecture

Monitoring is not an add-on

What is 'add-on' monitoring?

Monitored by design

Azure Kubernetes and Azure Monitor

Summary

Chris Coveyduck

Comments

Featured Posts

SDN using OVN and OVS (Part 1)