Reliability and error budget calculations
This guide answers questions about system reliability and error budgets. We'll explore the following:
- How do reliability and error budgets recover after setbacks?
- Whether good incoming traffic can reduce the impact of a recent incident?
Reliability and error budget: what's it all about?β
Before diving deeper into these areas, letβs take a look at the definitions of reliability and error budget. These values are closely related.
Reliabilityβ
Reliability is a ratio of good to total events within the time window of your SLO. For example, if your SLO is configured with a rolling time window with a length of 28 days, you can sum all incoming points from the last 28 days (total_count
) and all incoming good points (good_count
) and then calculate the reliability with the following simple equation:
Reliability = (# Good events)/(# Total events) within the SLO's time window
Considering this, we can see that:
-
Reliability =
0%
means no good events are inside the SLO's time window -
Reliability =
100%
means all events inside the time window are good
Error budgetβ
Error budget represents the percentage of reliability above your SLO's target.
So, for instance, if weβd aim for 90%
, then 10%
(=100%-90%
) of this reliability would be our error budget:
objectives:
- countMetrics:
good:
prometheus:
promql: "1"
incremental: true
total:
prometheus:
promql: "1"
displayName: available1
name: objective-1
target: 0.9
value: 1
In our example, the SLO's target is 99.99%
.
It means that our example SLO's error budget comprises 0.01%
of its reliability:
...
timeWindows:
- unit: Day
count: 28
isRolling: true
objectives:
- countMetrics:
good:
prometheus:
promql: "1"
incremental: true
total:
prometheus:
promql: "1"
displayName: available1
name: objective-1
target: 0.99
value: 1
When will reliability recover after a bad event?β
For rolling time windows, the reliability recovers after the full time window duration has passed since the bad event occurred:
the time of bad event + time window duration: for example, 12 July, 12:54 + 28 days = 9 August, 12:54
This is because the rolling time window continuously moves forward, and data points are registered and dropped on a first-in, first-out (FIFO) basis. As a result, the bad event falls out as soon as the time window rolls over that time. After this happens, the old bad event is no longer considered in calculations. If good points replace the bad ones, the reliability improves.
Rolling time window use caseβ
Since we set the time window value to 28 in our example, both reliability and error budget will recover after 28 days of any bad event. This is the configuration of our SLO:
...
timeWindows:
- unit: Day
count: 28
isRolling: true
Now, we can see three spikes of bad events incoming to Nobl9 to our SLO:
The first incident happened on 12 July at around 08:00
.
Before it, our reliability and error budget = 100%
.
We started to burn the error budget once the number of good points became lower than the total points.
This situation lasted for around four hours.
The reliability steadily decreased over time for this duration and eventually finished on 99.9981362405808%
.
Image 3 illustrates this scenario. The blue line represents reliability:
The time window rolls further, and after 28 days, the bad points are dropped out of it:
On 9 August, 28 days after 12 July, the reliability increases to the value lost on 12 July, that is:
The second incident happened on 14 July at around 10:00
.
The time window is still 28 days,
so after this period, the bad points will be swept out of it, and the error budget we lost will recover.
This recovery will happen on 11 August at around 10:00
.
Similarly, the third incident happened on 19 July at around 09:00
,
so the error budget will recover on 16 August at around 09:00
.
For calendar-aligned time windows, the reliability recovers upon the following:
- New good points outweigh that bad event during the same time window.
This applies only to the Occurrences error budget calculation method. - The next time window starts
Does incoming good traffic reduce the impact of a bad event in the past?β
In short, the answer is yes. This scenario can happen when the SLO is configured with the Occurrences error budget calculation method, regardless of the time window type.
This method considers the total number of events within the time window period. The more events are registered within the time window, the smaller a single event weighs.
This is explained by the reliability definition: (# Good events)/(# Total events). Knowing that the total is always equal to the sum of good and bad, then:
Reliability = (# Good events)/(# Good + Bad events)
With an ever-rising number of good and bad events, the more good data points we receive, the higher the reliability, even if there are bad data points as well.
Occurrences use caseβ
Let's imagine we started a 1-day calendar-aligned period.
-
Until 1:00, we registered the following points:
bad=20
andgood=80
- This gave us reliability
80/(80+20)=80/100=0.8
- This gave us reliability
-
During the next hour, we registered additional
good=100
- At 2:00, reliability was
180/(80+20+100)=180/200=0.9
- At 2:00, reliability was
As a result, reliability increased by 0.1
because of the overall number of incoming points.
Another example with the rolling time window describes the situation when expired and new incoming data points have different values. The reliability and error budget will change, although no bad events are registered.
-
Assume, we had
good=220
andbad=40
data points- It gave us reliability
220/(220+40)=220/260=0.846
- It gave us reliability
-
The time window rolled. It resulted in the following:
-
Expired point:
good=20
-
Registered new point:
good=50
-
No bad points have arrived or expired
-
Reliability changed:
(220-20+50)/(220-20+50+40)=250/290=0.862
-
As a result, reliability increased by 0.16
because of the overall number of incoming points.
Similarly, if the expired good point is greater than the arrived point,
the reliability will decrease because of the reduced good/total
proportion.