Reliability, observability, monitoring

Reading time: 0 minute(s) (0 words)

Reliability

We can define service reliability as doing what the service is supposed to do. Being available for users and customers is of course very important, but service can easily be available while not actually being reliable. Even if your service is up and available to your users, it can’t be reliable if it’s only returning errors to every request or too many requests.

At that point, you’d probably want to measure the reliability of your service. As such, you need to ensure that it’s up, available, and isn’t returning too many errors. And that’s when you can use a service level indicator (SLI)! It’s a measurement that captures as much of the user journey as possible. A good SLI is the accurate measurement of the reliability of your service, and an SLO-based approach to reliability needs good SLIs in order to be effective.

Monitoring

Monitoring is about collecting metrics from a production system to gain an understanding of what’s really going on. It can be beneficial for debugging and triggering alerts when something needs attention. Monitoring helps us gather data to understand better what’s happening in our software and systems.

The challenge of monitoring is separating the proper signal, that is, the few critical things that need attention from the noise, meaning, the many false alerts.

Suppose you’ve already invested in monitoring tools like Datadog, Prometheus, New Relic, etc. For sure—you need them. But it would be best if you had SLOs too. As it's a significant source of the data, we need to follow the SLO-based approach to software reliability. In that sense, monitoring and SLOs are complementary—they go hand in hand. SLOs complete the concept by focusing the attention on what truly matters to customers.

Observability

The concept of observability is similar to but slightly different from monitoring. Observability is a measure of how well we can understand the internal state of a system by solely looking at its outputs. In other words, it is how well we can deduce internal causes by observing external symptoms. This includes things like system stats, availability of a service, dependencies, and errors. The more observable our infrastructure is, the more success we will have in diagnosing and curing problems that arise.

For SREs, the goal is to implement and monitor those measurements, so they can have more effective conversations about the state of the product and where they should be putting in their efforts.