Chapter 10: Monitoring

Building Green Software by Anne Currie, Sarah Hsu and Sara Bergman, published by O'Reilly is available here under a CC BY-NC-ND Creative Commons license i.e. you can read it and quote it for non commercial purposes as long as you attribute the source (O'Reilly's book) and don't use it to produce derivative works.

You can buy the book from good bookstores including Amazon in all regions (currently on offer in the UK). or read it on the O'Reilly site if you are an O'Reilly subscriber.

Monitoring

There was once a product called Chuck that was so well-built it naturally had 99.9999% availability (Yup, he was practically always ready to handle any requests!).

Chuck lived a peaceful life free of downtime and outages in Production Land. One ordinary day, much like any other, he was minding his own business and strolling down Production Avenue when he suddenly felt a sharp loss of connectivity and had to sit slowly on the sidewalk. Chuck thought to himself, “Is this it? Am I finally falling over?”

Was Chuck experiencing a once-very-distant-memory network outage?!?!!

Chuck in Production Land is no fairy tale; it's the real-life story of a Google product named Chubby that was very well-architected and proved to be so reliable that it led to a false sense of security amongst its users. They conned themselves into believing that it would never go down and so increased their dependence on it well beyond its advertised, observed for, and monitored availability.

We all know that unicorns are mythical creatures that don’t exist, and so is 100% uptime for a software product. Even though Chubby rarely faced any incidents, they still occasionally happened, leading to unexpected (and, more importantly, unplanned for) disruptions in its downstream services.

For Google, the solution to this unicorn-ish scenario was to deliberately bring its own system down often enough to match its advertised uptime, thereby ensuring Chubby didn’t overdo it on its availability, thus never again lulling its users into a false sense of security.

The moral of this story is that while striving for the highest possible availability of a product may be an exciting and daring engineering problem to conquer, there are negative consequences that shouldn’t be overlooked. These include not only over-reliance by users but also software-induced carbon emissions.

Before we continue Chuck’s story, we should spend some time defining availability and why the world of tech fawns over it. Afterward, we'll return to basics, examining the why and how of monitoring before taking a detour to discuss why SREs don’t think traditional monitoring is enough for this world of microservices we have created for ourselves. Finally, the big gun. We will briefly investigate Observability and how it fits in with sustainability in tech. This way, as green software practitioners, we can keep up with the ever-increasingly complex world of distributed systems.

We firmly believe that monitoring software systems’ carbon emissions deserves its own shebang (i.e., a chapter)! Not because there is an overwhelming amount of material to examine or tooling to compare and contrast yet, but because it is of the utmost importance for tech professionals to start thinking early on about how being green will integrate with the well-established realms of DevOps and SREs.

It's crucial for us not to reinvent the wheel but to follow the industry standards when figuring out carbon emission monitoring.

Availability as a North Star

A system’s availability is about whether it delivers its intended purpose at any given time. This is often quantified as a probability derived from the ratio of successful operations to total operations- expressed as uptime / (uptime + downtime). Uptime signifies when a system is up and running (a.k.a operational), while downtime, as suggested by its name, represents when a system is not. Regardless of the metric used, the result of this calculation is a percentage, such as 99.99%, colloquially known as “four nines,” as illustrated in Table 10-1.

Availability Level 90% (“one nine”) 99.9% (“three nines”) 99.999% (“five nines”)

Allowed downtime per year 36 days 8.76 hours 5.26 minutes

Allowed downtime per month 3 days 43.5 minutes 25.9 seconds

Allowed downtime per day 2.4 hours 1.44 minutes 0.87 seconds

Table 10-1 Availability table expressed per year, per month, and per day

Mathematically, those “'nines,” seen in the first column, are not directly translated into difficulties. However, do not underestimate their significance; the more “nines” there are, the more challenging things become across the board for a system. Why? Because availability doesn’t just indicate how much time a system needs to be functional; it also dictates the time allowed for it to be non-operational. Non-operational time in this context includes both planned and unplanned downtime.

You can probably imagine, or gather from Table 10-1, the sheer amount of work required to design, develop, deploy, and manage a “five nines” product that allows only 5.26 minutes of unavailability in the entire year! This substantial workload will also explicitly and implicitly impact the product's carbon emissions.

Availability is just one of the many signals that DevOps and SREs care about. Have you heard the magical tale of the four golden signals, or as we like to call them, the four horsemen of metric-based monitoring? They represent the foundation of modern monitoring.

Four Horsemen of Metrics-Based Monitoring

In software engineering, monitoring involves collecting, visualizing, and analyzing metrics on an application or its system. It's a way for the application’s owner to ensure that the system is performing as intended. In other words, we use metrics, or quantifiable measurements, to describe various aspects of software, its network, and its infrastructure.

The four golden signals, as detailed in Google’s definitive Site Reliability Engineering book, are “latency, traffic, errors, and saturation.”

Latency refers to how long it takes for a product to process a request (i.e., the time taken for a user to send a request and receive a response).
Traffic measures the amount of demand that a software experiences (i.e., the total number of requests users have sent in a period).
Error focuses on the unsuccessful requests that the system is handling, indicating the rate at which the system returns error responses to its users.
Saturation refers to the extent to which a computing resource (e.g., CPU or Memory) is being utilized at a given time.

<Sidebar> We've just introduced you to a lot of SRE jargon. Let's use the analogy of buying a Glastonbury Festival ticket (for non-UK folks, imagine it’s like trying to purchase a Taylor Swift concert ticket) to make sense of it all.

Picture this: You've successfully joined the online ticket queue. Everyone, including you, attempting to secure a Glasto ticket contributes to the traffic (or demand) that the ticket platform is experiencing. Now, by some miracle, you've made it through the queue and are inputting your registration details and paying for the tickets. However, the system keeps returning errors because of your continuous entry of incorrect information, leading to a spike in the platform's error rate due to poorly formatted user input—bad luck. Due to your rubbish typing coupled with a bad UI, you failed to get your ticket in time.

In a surprising turn of events, your less over-excited friend successfully inputs her details, secures a ticket, and receives a confirmation response from the website. The time taken for the platform to process the payment request and return a successful response to its customer is a function of latency.

In 2023, over 20,000 Glasto tickets were sold out within 60 minutes. To prepare for the ever-increasing demand, each year the Glasto team ensures there are more resources than ever ready to handle the workload, ensuring the system is never too full in terms of resource saturation. </Sidebar>

How would this scenario look from a carbon metric perspective? Before we get to that, let’s discuss why people get so worked up by those arbitrarily defined quantities.

Service-Level is Why We Are Here

If there is one thing that everyone, from designers, developers, testers, managers, and even salespeople, can agree on, it is the desire for customers’ happiness. But how do we ensure everyone is on the same page and agrees on what it takes to make this happen? Cue service-level metrics!

According to Google, service-level metrics should be the main drivers behind business objectives, which include user satisfaction. Let’s spend the next couple of breaths going over the details so we can agree on the definitions and discuss how carbon emissions can join the pack of service levels.

First up, the Service Level Indicator (SLI) is a direct measurement (i.e., quantity) describing a characteristic of a system. You might have already picked up that any of the four golden signals can be used as an indicator in this context. So, if our SLI is request latency, it means that we are calculating how long it takes for the system to return a response to a request.

Next, the Service Level Objective (SLO) is about how compliant we are with a particular SLI over a specific timeframe. This is commonly expressed as SLI over a period of time, for example, request latency for the past 28 days. SLOs are regarded as the fountain of happiness we want all our engineering teams to drink from. We see them as the cornerstone of SRE principles.

We can also use SLOs to drive decision-making. For instance, during the planning of the next sprint, if we realize we're not going to meet our SLO for the current month, we can take a step back and figure out why availability is suffering before pushing out new features.

Lastly, Service Level Agreements (SLAs) are the formal agreements between a service provider and a service user. We can think of them as a collection of SLOs, such as uptime, latency, error rate, etc.

So, we can see that service-level metrics are a useful tool to steer data-driven decisions. They not only give service providers the opportunity to deliver on their promises that can be tracked and measured precisely, but they also provide us with a framework that can be leveraged by many other disciplines in software engineering, specifically sustainable computing.

When a Carbon Metric is Ready

Imagine, we have it all figured out, like what we talked about in Chapter 9: we have a Prometheus-style real-time carbon metric that is ready to be used by anyone and everyone. Hurray! We can now throw this metric over the fence to our SRE friends so they can start tracking this metric with the rest of the signals.

<Sidebar>A Prometheus-style metric is currently one of the most popular choices for metrics-based monitoring. By following the style, we are collecting data that is readable, flexible, and useable. For example, we can easily deduce that http_requests_total is a metric that calculates the total number of HTTP requests received by a particular system. For more juicy details on why the Prometheus monitoring ecosystem rules, check out Prometheus: up and running. (Spoiler alert: PromQL, Prometheus’s functional query language, is one of the main stars of the show because it lets its user select and aggregate data in real time.</Sidebar>

Remember our favorite character, Chuck? Let’s envision yet another typical day in Production Land. Chuck the Cheerful, who has promised his clients a “one nine” service both for uptime and renewable energy use, is savoring the last rays of sunshine for the year while diligently working to perform more data manipulation using clean energy. Suddenly, thanks to climate change, heavy rain begins to fall, causing Chuck to lose access to his only clean energy source- solar power.

Chuck is a stickler for best practices and follows SRE principles. He, therefore, has a system in place to measure and monitor the carbon emissions from his data manipulation, ensuring that he doesn’t compromise his commitment to his users in terms of carbon emissions for this particular service.

Faced with having to switch to non-renewable energy sources, such as coal or oil, Chuck recalls that he still has error budgets, precisely up to 3 days in this case, left for the month to utilize non-renewable energy. Unfortunately, he has no error budget left for failing to provide any service at all because he experienced a major outage the week before.

<Sidebar>Error budget describes the leftover SLOs in a given a period of time. In other words, if your availability is “three nines,” that means you have up to 43.2 minutes for any planned or unplanned downtime. </Sidebar>

Chuck thus switches to non-renewables to maintain all his SLOs. Remember, this is a chapter about monitoring. If Chuck hadn’t fallen over last week, he could have turned himself off and saved the carbon. C’est la vie. Error budgets work both ways. They are still the right way to handle carbon emissions.

What we have described here is simply slotting in carbon metrics with what the industry is already doing very well- production monitoring. Again, it’s paramount for green software practitioners to not duplicate efforts to create an ecosystem where we can efficiently and effectively monitor environment-related metrics.

Observability

Let’s rewind to the early 2000s in Production Land when Chuck's dad, Charles, lived peacefully with many neighbors whom he was friendly with but didn’t interact with much. During that time, Charles and his neighbors were known as monolithic. They were self-contained applications that didn’t need each other to fulfill their one-true purposes in their Land.

Fast-forward to 2023, Chuck not only has to interact with every neighbor on his street, but he sometimes has to reach out to residents at the other end of the city to complete a simple task for users. Chuck and his many neighbors are what we call microservices that make up a distributed system.

Despite the numerous benefits offered by this new consortium of Chuck and Chuck’s friends compared to Charles's world, it also introduces unprecedented complexity in determining precisely which one of them was the problematic one when an issue occurred. This intricacy in debugging problems, identifying the source of the issue, and understanding the “unknown-unknowns” in a distributed system is how observability came about.

The Anticipated Showdown: Observability v.s. Monitoring

Observability and monitoring are often muttered in the same sentence, yet they are two distinct concepts. Let’s spend a moment to discuss why this showdown was highly anticipated and how, in reality, one doesn’t replace the other but complements it.

Even though Chuck and his friends’ brand-new world of modern software systems (a.k.a highly distributed architectures) boasts numerous perks such as flexibility, scalability, etc., which we probably don’t need to write about again. An important negative characteristic that is often overlooked is the complexity they bring about.

The Rumsfeld Matrix is a handy tool for determining the uncertainty of an issue. For example, in the SRE space, we summarize the difference between monitoring and observability as helping identify “known” and “unknown” bugs, respectively.

We see monitoring as just enough for back-in-the-day monolithic applications because pinpointing where and why things have gone wrong in a self-contained fashion was relatively straightforward. For instance, we could easily collect the CPU usage of an application, plot a graph, and set up an alert to monitor its performance, preventing the software from becoming overwhelmed.

This issue of an overworked CPU is a well-established “known-unknown” problem that many monolithic products face. Generally, it’s a relatively painless bug to solve because CPU usage is a metric we know how to monitor, and the problem itself is well understood.

However, bugs can become tricky very quickly as we move through the quadrants (a.k.a. facing more intertwined microservices) to things we are neither conscious of nor comprehend (a.k.a. “unknown-unknowns”). Consider this: what metrics and dashboards do you need to set up to detect an issue that only affects the Pixel 7A in Taiwan if you are a mobile app developer supporting the past five generations of Pixel phones in over 30 countries?

This is where traditional monitoring falls short: To have meaningful metrics, dashboards, and alerts, you will need to predict what could have gone wrong and where it might have gone wrong. Basically, it's a murder mystery without any clues! If you ask anyone who has held a pager before, they will tell you that dealing with the same incidents more than once in a distributed system was a rare occurrence.

Observability originated from control theory. Specifically, it is defined as a measure of how well the software system’s internal states can be understood from its external outputs. This emerging topic allows us to identify not only that something is broken but also where and why it's broken. Neat, right?

What observability is not is only about the three pillars of telemetry—traces, metrics, and logs. Observability starts with collecting telemetry, but simply collecting it will not help us understand our system and make it observable. A system is truly observable only when we can utilize the collected telemetry to explain what is going on inside our system based on the observations we have made from the outside.

So monitoring is what will allow us to know that someone in Production Land is facing network issues, but it’s observability that will allow us to know it’s Chuck’s next-door neighbor’s bathroom that needs a firewall resync.

Are we ready for observability?

So what do you think? Is green software ready for observability? How should we ensure carbon metric moves with the hot new topic?

We did ask you in the previous section, When a Carbon Metric is Ready, to dream about when we had it all figured out! In this section, we want you to continue this fantasy of ours, where we not only have standardized real-time metrics but also traces that can correlate events happening in intricate distributed systems. These correlations will help us pinpoint which event triggered which component to emit the most energy and, therefore, allow us to treat this bottleneck effectively and directly.

A truly observable system will not only help DevOps and SREs debug and troubleshoot an outage but also assist green software advocates in staying compliant with their sustainability SLOs in the most efficient ways. We want to know quickly which events are the culprit behind the high electricity usage!

We will get there.

We hope we kept our promise and that this chapter is short and sweet.

We consider this chapter as “Monitoring 101 with a Touch of Green,” an introductory guide to monitoring for those less familiar with the field, with our favorite character, Chuck, leading the way as we explore this space through the lens of green software.

Monitoring and observability have been the foundation of modern production system management for quite some time. This is because what's Rick without his sidekick, Morty? We are looking at significant downtime without either of them.

Borrowing from the Ponemon Institute's eye-opening statistics, the average cost of downtime per minute is a staggering $9,000, leading to well over $500,000 for an hour of downtime. We, therefore, need this entertaining duo to help Chuck shorten his outage recovery time, saving us, his overlords, a significant amount of money and, naturally, reducing carbon emissions.

Page updated

Google Sites

Report abuse