Skip to main content

Reliability and error budget calculations

Reading time: 0 minute(s) (0 words)

This guide will answer questions about system reliability and error budgets. We'll explore how reliability and error budgets recover after setbacks. We’ll also discuss whether good incoming traffic can reduce the impact of a recent incident.

Reliability and error budget​

Before diving deeper into these areas, let’s have a look at the definitions of Reliability and Error budget. These values are closely related.

Reliability​

Reliability is a ratio of good to total events within the Time Window of your SLO. For example, if your SLO is configured with a rolling time window with a length of 28 days, you can sum all incoming points from the last 28 days (total_count) and all incoming good points (good_count) and then calculate the reliability with the following simple equation:

reliability equation
Image 1: Simplified equation for calculating reliability
tip

Considering this, we can see that:

  • Reliability = 0% means no good events are inside SLO's time window

  • Reliability = 100% means all events inside the time window are good

Error budget​

Error budget represents the percentage of reliability above your SLO's target.

So, for instance, if we’d aim for 90%, then 10% (=100%-90%) of this reliability would be our Error budget:

    objectives:
- countMetrics:
good:
prometheus:
promql: "1"
incremental: true
total:
prometheus:
promql: "1"
displayName: available1
name: objective-1
target: 0.9
value: 1

Our SLO, however, has a target of 99.99%, which means that its error budget comprises 0.01% of its reliability:

  ...
timeWindows:
- unit: Day
count: 28
isRolling: true
objectives:
- countMetrics:
good:
prometheus:
promql: "1"
incremental: true
total:
prometheus:
promql: "1"
displayName: available1
name: objective-1
target: 0.99
value: 1

Taking this into consideration, we can now move forward to address the following questions:

When will reliability recover after a bad event?​

Both reliability and error budget will recover after 28 days of any bad event. This is the configuration of our SLO:

...
timeWindows:
- unit: Day
count: 28
isRolling: true

Now, we can see three spikes of bad events incoming to Nobl9 to our SLO:

reliability equation
Image 2: Spikes in error budget burn

The first incident happened on 12 July at around 08:00. Before it, our reliability and error budget = 100%. However, we started to burn the budget when the number of good points was lower than the total points. We can see that on the chart below (where the blue line represents reliability). This situation lasted for around 4 hours. The reliability steadily decreased over time for this duration and eventually finished on 99.9981362405808%.

spike chart
Image 3: Decrease of reliability in the SLO

The budget will recover 28 days after this incident (these points will be dropped out of the time window), and we’ll recover the burnt Error budget. This increase will be equal to the value lost on 12 July, that is:

reliability recovery equation
Image 4: Reliability recovery

The recovery will happen on 9 Aug.

The second incident happened on 14 July at around 10:00, so, given the 28-day rolling time window, the error budget we lost will recover on 11 Aug. Similarly, the third incident happened on 19 July at around 09:00, so the error budget we lost then will recover on 16 Aug.

Does incoming good traffic reduce the impact of a bad event in the past?​

In short, the answer is: yes. But let's go back to the definition of reliability (good_count/total_count) to understand what's happening here. Let's also split total_count into good_count + bad_count, because the total number of events is always equal to the sum of all good and all bad events.

Remember that our SLO is configured with a rolling time window (28-day length). It means we're always looking at the last 28 days from now and counting good_count, bad_count, and total_count. As the time window rolls, some points fall out every minute, and some new points can fall into it. It's natural for this SLO’s configuration that the total number of events will change in this process.

Underlying logic – example​

This situation occurs because the expired point had a different value than a new incoming point. The reliability and error budget will change despite no bad events in such a scenario. Consider this example:

  • Old point (expired): good=20

  • New point (incoming): good=50

  • The number of good points in the time window good_count=200, excluding two points mentioned above (old and new)

  • The number of bad events in the time window bad_count=40 (The Number of bad events did not change because only good points arrived).

Let's calculate the reliability before the arrival of new points:

  • (200 + 20) / 200 + 20 + 40 = 220/260 = 84.6153846154%

When the new point arrives, good=20 expires, and we now take good=50 into account instead of good=20. The number of bad events does not change (still equal to 40). The reliability changes to the following:

  • (200 + 50) / 200 + 50 + 40 = 250/290 = 86.2068965517%

Summary​

Our reliability increased even though the number of bad events in this time window did not change. It changed because the number of good events increased, so the proportion of good/total events also increased.

Similarly, if the Old good point (expired) were greater than the New good point (Incoming), the reliability would decrease similarly because the proportion of good/total would decrease.

This behavior is a natural part of SLO calculations for all configurations with a dynamic number of events, such as occurrences. You can also experience a similar behavior for all calendar-aligned SLOs that started a new calendar-aligned time window.