Use Case of SLO Configuration
In this section, we walk through creating an SLO for a sample service using
A Typical Example of a Latency SLO for a RESTful Service
First, we want to pick an appropriate Service Level Indicator to measure the latency of responses from a RESTful service. In this example, let's assume our service runs in the NGINX web server, and we're going to use a threshold-based approach to define acceptable behavior. For example, we want the service to respond in a certain amount of time.
There are many ways to measure application performance. In this case we're giving an example of server-side measurement at the application layer (NGINX). However, it might be advantageous for your application to measure this metric differently.
For example, you might choose to measure performance at the client, or at the load balancer, or somewhere else. Your choice depends on what you are trying to measure or improve, as well as what data is currently available as usable metrics for the SLI.
The threshold approach uses a single query, and we set thresholds or breaking points on the results from that query to define the boundaries of acceptable behavior. In the SLO YAML, we specify the indicator like this:
In this example, we’re using Prometheus, but the concepts are similar for other metrics stores. We recommend running the query against your Prometheus instance and reviewing the resulting data, so you can verify that the query returns what you expect and that you understand the units (whether it's returning latencies as milliseconds or fractions of a second, for example). This query seems to return data in between 60 and 150 milliseconds, with some occasional outliers.
Choosing a Time Window
When creating an SLO, we need to choose whether we want a rolling or calendar-aligned window:
Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, such as every calendar month, or every quarter.
Rolling windows are better for tracking the recent user experience of service (say, over the past week or month).
For our RESTful service, we will be using a rolling time window to measure the recent user experience. This will help us make decisions about the risk of changes, releases, and how best to invest our engineering resources on a week-to-week or sprint-to-sprint basis. We want the "recent" period that we're measuring to trail ong enough to smooth out noise.
We’ll go with a 28-day window, which has the advantage of containing an equal number of weekend days and weekdays as it rolls:
- count: 28
Choosing a Budgeting Method
There are two budgeting methods to choose from: Time Slices and Occurrences.
In the Time Slices method, what we count (the objective we measure) is how many good minutes were achieved (minutes where our system is operating within defined boundaries), compared to the total number of minutes in the window.
This is useful for some scenarios, but it has a disadvantage when we're looking at the “recent” user experience, as we are with this SLO. The disadvantage is that a bad minute that occurs during a low-traffic period (say, in the middle of the night for most of our users, when they are unlikely to even notice a performance issue) would penalize the SLO the same amount as a bad minute during peak traffic times.
The Occurrences budgeting method is well suited to this situation. With this method, we count good attempts (in this example, requests that are within defined boundaries) against the count of all attempts (i.e., all requests, including requests that perform outside of the defined boundaries). Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes.
We’ll go with the Occurrences method:
Let’s assume we've talked to our product and support teams and can establish the following thresholds:
- The service has to respond fast enough that users don't see any lag in the web applications that use the service. Our Product Manager thinks that 100ms (1/10th of a second) is a reasonable threshold for what qualifies as okay latency. We want to try to hit that 95% of the time, so we code the first threshold like this:
- budgetTarget: 0.95
This threshold requires that 95% of requests are completed within 100ms.
You can name each threshold however you want. We recommend naming them how a user of the service (or how another service that uses this service) might describe the experience at a given threshold. Typically, we use names that are descriptive adjectives of the experience when the threshold is not met. When this threshold is violated, we can say that the user's experience is "Laggy."
- Some requests fall outside of that 100ms range. We want to make an allowance for that, but we also want to set other thresholds so that we know that even in its worst moments our service is performing acceptably, and/or that its worst moments are brief. Let's define another threshold. In the above threshold, we allow 5% of requests to run longer than 100ms. We want most of that 5%—say, 80% of the remaining 5% of the queries—to still return within 1/4th of a second (250ms). That means 99% of the queries should return within 250ms (95% +4%), so we’ll add a threshold like this:
- budgetTarget: 0.99
This threshold requires that 99% of requests are completed within 250ms.
- While that covers the bulk of requests, even within the 1% of requests that we allow to exceed 250ms, the vast majority of them should complete within half a second (500ms). So, let’s add the following threshold:
- budgetTarget: 0.999
This threshold requires that 99.9% of requests are completed within 500ms.
In sum, our SLO definition for this example use case looks like this:
- apiVersion: n9/v1alpha
nrql: SELECT average(duration) FROM SyntheticRequest WHERE monitorId = 339adbc4-01e4-4517-88cf-ece25cb66156'
- displayName: ok
- displayName: laggy
- displayName: poor
- count: 1