Use case of SLO configuration
In this section, we walk through creating an SLO for a sample service using sloctl
.
Remember that you can test and adjust your SLO's targets and view updated error budgets and error budget burn down with SLI analyzer.
Once you've determined the right target using your existing data, you can easily create a new SLO.
A typical example of a latency SLO for a RESTful serviceโ
First, we want to pick an appropriate service level indicator to measure the latency of responses from a RESTful service. In this example, let's assume our service runs in the NGINX web server, and we're going to use a threshold-based approach to define acceptable behavior. For example, we want the service to respond in a certain amount of time.
There are many ways to measure application performance. In this case we're giving an example of server-side measurement at the application layer (NGINX). However, it might be advantageous for your application to measure this metric differently.
For example, you might choose to measure performance at the client, or at the load balancer, or somewhere else. Your choice depends on what you are trying to measure or improve, as well as what data is currently available as usable metrics for the SLI.
The threshold approach uses a single query, and we set thresholds or breaking points on the results from that query to define the boundaries of acceptable behavior. In the SLO YAML, we specify the indicator like this:
indicator:
metricSource:
name: my-prometheus-instance
project: default
rawMetric:
prometheus:
promql: server_requestMsec{job="nginx"}
In this example, weโre using Prometheus, but the concepts are similar for other metrics stores. We recommend running the query against your Prometheus instance and reviewing the resulting data, so you can verify that the query returns what you expect and that you understand the units (whether it's returning latencies as milliseconds or fractions of a second, for example). This query seems to return data in between 60 and 150 milliseconds, with some occasional outliers.
Choosing a time windowโ
When creating an SLO, we need to choose whether we want a rolling or calendar-aligned window:
-
Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, such as every calendar month, or every quarter.
-
Rolling windows are better for tracking the recent user experience of service (say, over the past week or month).
For our RESTful service, we will be using a rolling time window to measure the recent user experience. This will help us make decisions about the risk of changes, releases, and how best to invest our engineering resources on a week-to-week or sprint-to-sprint basis. We want the "recent" period that we're measuring to trail ong enough to smooth out noise.
Weโll go with a 28-day window, which has the advantage of containing an equal number of weekend days and weekdays as it rolls:
timeWindows:
- count: 28
isRolling: true
period:
begin: "2020-12-01T00:00:00Z"
unit: Day
Choosing a budgeting methodโ
There are two budgeting methods to choose from: Time Slices and Occurrences.
In the Time Slices method, we count (the objective we measure) how many good minutes were achieved (minutes where our system operates within defined boundaries), compared to the total number of minutes in the window.
This is useful for some scenarios but has a disadvantage when looking at the โrecentโ user experience, as with this SLO. The disadvantage is that a bad minute that occurs during a low-traffic period (say, in the middle of the night for most of our users, when they are unlikely even to notice a performance issue) would penalize the SLO the same amount as a bad minute during peak traffic times.
The Occurrences budgeting method is well suited to this situation. With this method, we count good attempts (in this example, requests that are within defined boundaries) against the count of all attempts (i.e., all requests, including requests that perform outside of the defined boundaries). Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes.
Weโll go with the Occurrences method:
budgetingMethod: Occurrences
Establishing thresholdsโ
Letโs assume we've talked to our product and support teams and can establish the following thresholds:
- The service has to respond fast enough that users don't see any lag in the web applications that use the service. Our Product Manager thinks that 100ms (1/10th of a second) is a reasonable threshold for what qualifies as okay latency. We want to try to hit that 95% of the time, so we code the first threshold like this:
- budgetTarget: 0.95
displayName: Laggy
value: 100
op: lte
This threshold requires that 95% of requests are completed within 100ms.
You can name each threshold as you like. We recommend naming them how a user of the service (or how another service that uses this service) might describe the experience at a given threshold. Typically, we use names that are descriptive adjectives of the experience when the threshold is not met. When this threshold is violated, we can say that the user's experience is "Laggy."
- Some requests fall outside of that 100ms range. We want to make an allowance for that. Still, we also want to set other thresholds to we know that even in its worst moments, our service is performing acceptably. Also, that its worst moments are brief. Let's define another threshold. We allow 5% of requests to run longer than 100ms in the above threshold. We want most of that 5%โsay, 80% of the remaining 5% of the queriesโto still return within 1/4th of a second (250ms). That means 99% of the queries should return within 250ms (95% +4%), so weโll add a threshold like this:
- budgetTarget: 0.99
displayName: Slow
value: 250
op: lte
This threshold requires that 99% of requests are completed within 250ms.
- While that covers the bulk of requests, even within the 1% of requests we allow to exceed 250ms, the vast majority of them should complete within half a second (500ms). So, letโs add the following threshold:
- budgetTarget: 0.999
displayName: Painful
value: 500
op: lte
This threshold requires that 99.9% of requests are completed within 500ms.
In sum, our SLO definition for this example use case looks like this:
- apiVersion: n9/v1alpha
kind: SLO
metadata:
displayName: adminpageload
name: adminpageload
project: external
spec:
alertPolicies: []
budgetingMethod: Occurrences
description: ""
indicator:
metricSource:
name: cooperlab
projects: default
rawMetric:
newRelic:
nrql: SELECT average(duration) FROM SyntheticRequest WHERE monitorId = 339adbc4-01e4-4517-88cf-ece25cb66156'
timeSeries:
objectives:
- displayName: ok
op: lt
tag: external.adminpageload.70d000000
target: 0.98
value: 70
- displayName: laggy
op: lt
tag: external.adminpageload.85d000000
target: 0.99
value: 85
- displayName: poor
op: lt
tag: external.adminpageload.125d000000
value: 125
service: venderportal
timeWindows:
- count: 1
isRolling: true
period:
begin: "2021-03-08T06:46:08Z"
end: "2021-03-08T07:46:08Z"
unit: Hour