Reliability and error budget calculations
This guide will answer questions about system reliability and error budgets. We'll explore how reliability and error budgets recover after setbacks. We’ll also discuss whether good incoming traffic can reduce the impact of a recent incident.
Reliability and error budget
Reliability is a ratio of good to total events within the Time Window of your SLO. For example, if your SLO is configured with a rolling time window with a length of 28 days, you can sum all incoming points from the last 28 days (
total_count) and all incoming good points (
good_count) and then calculate the reliability with the following simple equation:
Considering this, we can see that:
0%means no good events are inside SLO's time window
100%means all events inside the time window are good
Error budget represents the percentage of reliability above your SLO's target.
So, for instance, if we’d aim for
=100%-90%) of this reliability would be our Error budget:
Our SLO, however, has a target of
99.99%, which means that its error budget comprises
0.01% of its reliability:
- unit: Day
Taking this into consideration, we can now move forward to address the following questions:
When will reliability recover after a bad event?
Both reliability and error budget will recover after 28 days of any bad event. This is the configuration of our SLO:
- unit: Day
Now, we can see three spikes of bad events incoming to Nobl9 to our SLO:
The first incident happened on 12 July at around
08:00. Before it, our reliability and error budget =
100%. However, we started to burn the budget when the number of good points was lower than the total points. We can see that on the chart below (where the blue line represents reliability). This situation lasted for around 4 hours. The reliability steadily decreased over time for this duration and eventually finished on
The budget will recover 28 days after this incident (these points will be dropped out of the time window), and we’ll recover the burnt Error budget. This increase will be equal to the value lost on 12 July, that is:
The recovery will happen on 9 Aug.
The second incident happened on 14 July at around 10:00, so, given the 28-day rolling time window, the error budget we lost will recover on 11 Aug. Similarly, the third incident happened on 19 July at around 09:00, so the error budget we lost then will recover on 16 Aug.
Does incoming good traffic reduce the impact of a bad event in the past?
In short, the answer is: yes. But let's go back to the definition of reliability (
good_count/total_count) to understand what's happening here. Let's also split
bad_count, because the total number of events is always equal to the sum of all good and all bad events.
Remember that our SLO is configured with a rolling time window (28-day length). It means we're always looking at the last 28 days from now and counting
total_count. As the time window rolls, some points fall out every minute, and some new points can fall into it. It's natural for this SLO’s configuration that the total number of events will change in this process.
Underlying logic – example
This situation occurs because the expired point had a different value than a new incoming point. The reliability and error budget will change despite no bad events in such a scenario. Consider this example:
Old point (expired):
New point (incoming):
The number of good points in the time window
good_count=200,excluding two points mentioned above (old and new)
The number of bad events in the time window
bad_count=40(The Number of bad events did not change because only good points arrived).
Let's calculate the reliability before the arrival of new points:
(200 + 20)/
200 + 20 + 40=
When the new point arrives,
good=20 expires, and we now take
good=50 into account instead of
good=20. The number of bad events does not change (still equal to
40). The reliability changes to the following:
(200 + 50)/
200 + 50 + 40=
Our reliability increased even though the number of bad events in this time window did not change. It changed because the number of good events increased, so the proportion of
good/total events also increased.
Similarly, if the Old good point (expired) were greater than the New good point (Incoming), the reliability would decrease similarly because the proportion of
good/total would decrease.
This behavior is a natural part of SLO calculations for all configurations with a dynamic number of events, such as occurrences. You can also experience a similar behavior for all calendar-aligned SLOs that started a new calendar-aligned time window.