SLOs error budget
It's helpful to first examine how a standard SLO budget is determined to understand how the composite SLO budget is calculated. For standard SLOs, reliability is the ratio of successful events to total events or good minutes to total minutes over a specific time window. If an SLO has only successful events, its reliability over time would appear as follows:
Reliability
Reliability is a ratio, meaning the number of successful events can never exceed the total number of events. As a result, its value will always fall between 0
and 1
, or can be expressed as a percentage.
A realistic SLO that has some bad events (some error) will have reliability somewhere below 100%
:
Error budget target
Once reliability has been measured, we can set a budget target, which represents the minimum acceptable level of reliability. This target acts as a threshold on the Y-axis of the reliability chart. For example, let's say we picked our SLO's target at 75%
:
Reliability above the target is acceptable, while reliability below the target is not.
This means our error budget is the remaining 25%
(calculated as 100% - 75%
), representing the maximum number of errors we are willing to tolerate. To track how much of the error budget we've consumed, we scale the Y-axis of the reliability chart. In this context, 75%
reliability corresponds to 0% of the budget, while 100%
reliability represents 100%
of the budget
It's the same graph as before, but with different Y-axis labels. Here, we can see that the entire budget has been consumed, and it turns negative at the point where reliability falls below the target. When viewing the reliability graph, selecting a target for your standard SLO becomes straightforward.
Use-case example 1
Suppose you're tracking HTTP requests over a 28-day rolling period. You might set a reliability target of 99.9%
for that service, meaning you expect at least 99.9%
of all requests to be successful within any 28-day period.
Use-case example 2
Imagine you ping a service every minute and are focused on the time the service is available to clients during a calendar month. In this case, you would use the Timeslices method with a 1-month, calendar-aligned time window. After consulting with your stakeholders, you agree to accept up to 1 hour of downtime per calendar month. This expectation translates to a reliability target of 99.86%
, calculated as 99.86% = (720h - 1h) / 720h
(where 30 days equals 720 hours, with a 1-hour error budget).
Error budget of a composite SLO
Knowing how the budget is calculated for standard SLOs, let’s look at how it’s calculated for composite SLOs. The composite budget is calculated based on the reliability over a given time window of multiple other SLOs.
Let’s take a look at several standard SLOs with 100%
reliability:
Each of the component's reliability is expressed as a percentage on a scale from 0%
to 100%
.
Reliability is always calculated over a specific time period. When assessing the reliability of component SLOs aggregated into a composite SLO, the relevant time period is the composite’s time window, which may differ from the time windows configured for individual SLOs.
The reliability of a composite SLO is also measured on a scale from 0%
to 100%
, but it reflects the combined reliability of all its components. This can be visualized as a stacked area chart:
Now, let’s take a look at a more realistic example where each component has some error and reliability below 100%:
A composite SLO’s reliability composed of these SLOs would look like this:
Reliability of components are “stacked” and normalized to 100%
. In Nobl9, this result is presented without coloring of individual components:
The composite SLO’s target is also just a reliability threshold. It’s a point selected on the Y-axis that indicates the lowest acceptable reliability.
Let’s assume the target, for example composite is set to 75%
:
The remaining budget of composite SLO is the same as the reliability, but with the Y-axis scaled, the target is at 0%
. This reflects that, by definition, we are accepting our reliability to be below 100%
but not lower than the target.
This is the same chart as above, but with the Y-axis scaled. The peaks and valleys appear steeper, but that's simply a result of stretching the diagram vertically.
Error budget of a composite SLO with weighted components
So far we've considered a scenario where all components were weighted equally:
SLO A | SLO B | SLO C | SLO D | |
---|---|---|---|---|
Weights | 1 | 1 | 1 | 1 |
Normalized Weights | 1 / (1 + 1 + 1 +1) = 25% | 1 / (1 + 1 + 1 +1) = 25% | 1 / (1 + 1 + 1 +1) = 25% | 1 / (1 + 1 + 1 +1) = 25% |
All normalized weights in our example are equal to 25%
, which indicates that in the chart of the “composite’s reliability without errors,” each component contributes 25%
to the composite’s overall reliability:
Maximum of composite SLO’s reliability that can be burned by a given component equals to this component's normalized weight.
It implies that a single component SLO, unless it’s the only component existing in a composite SLO, can not bring the reliability of a composite SLO down to 0%
.
Let’s take a look at how our example changes when assigned different weights to different components:
SLO | SLO A | SLO B | SLO C | SLO D |
---|---|---|---|---|
Weights | 8 | 4 | 1 | 2 |
Normalized Weights | 8 / (8 + 4 + 1 + 2) = 53% | 4 / (8 + 4 + 1 + 2) = 27% | 1 / (8 + 4 + 1 + 2) = 7% | 2 / (8 + 4 + 1 + 2) = 13% |
Reliability of a composite SLO, when all component SLOs have 100%
reliability but different weights now looks like this:
Note how the thickness of different bands corresponds to their component's normalized weight. The larger the normalized weight, the thicker the band.
An important thing to note is that we can assign different weights and still get the same values of normalized weights:
SLO | SLO A | SLO B | SLO C | SLO D |
---|---|---|---|---|
Weights | 24 | 12 | 3 | 6 |
Normalized Weights | 24 / (24 + 12 + 3 + 6) = 53% | 12 / (24 + 12 + 3 + 6) = 27% | 3 / (24 + 12 + 3 + 6) = 7% | 6 / (24 + 12 + 3 + 6) = 13% |
The absolute weight of a single component doesn’t matter. What matters is the ratio of that weight to the weights of other components.
If we take the same component SLO data, with some errors as before but with the new weights, then our composite SLO’s reliability would look like this:
Now, let’s set a 75%
target over that composite SLO:
The top 25%
is our error budget, so the budget burned chart will look like this:
We can notice a few things about our newly calculated composite budget:
- It is different than the case when all weights were equal.
- The shape of the budget somewhat resamples the shape of SLO A’s budget. That is because SLO A has significantly more weight than other components. Other components also contribute to the burn of the composite SLO’s budget, but not as much.