Reliability and error budget calculations

Reading time: 0 minute(s) (0 words)

This guide answers questions about system reliability and error budgets. We'll explore the following:

How do reliability and error budgets recover after setbacks?
Whether good incoming traffic can reduce the impact of a recent incident?

Reliability and error budget

Before diving deeper into these areas, let’s take a look at the definitions of reliability and error budget. These terms are closely related.

Reliability

Reliability is a ratio of good to total events within the time window of your SLO.

For example, if your SLO's time window is 28 days, then the ratio between all incoming events from the last 28 days and all incoming good events from the same period would be your service reliability:

Reliability = (# good events)/(# total events)
within the SLO's time window

Considering that total = good + bad:

Reliability = (# good events) / (# good + bad events)
within the SLO's time window

Error budget

Error budget represents the percentage of reliability above your SLO's target.

So, for instance, we aim for 95% reliability. Then, our error budget would be 5% (100%-95%) of this value:

A fragment of an SLO YAML definition with target set to 95%
  objectives:
    - displayName: Good response (200)
      value: 200
      name: ok
      target: 0.95
      rawMetric:
        query:
          generic:
            query: >-
              SINCE N9FROM UNTIL N9TO FROM a1: entities(aws:postgresql:123)
              FETCH a1.metrics("infra:database.cpu.utilization",
              "aws-cloudwatch"){timestamp, value} LIMITS
              metrics.granularityDuration(PT1M)
      op: lte
      primary: true

Reliability and error budget recovery

As a rule of thumb, both reliability and the error budget recover as a bad event's impact on calculations decreases.

In particular, reliability and error budget recovery depends on the time window type and budgeting method combination:

Budgeting method	Calendar-aligned time windows	Rolling time windows
Time slices	Once the next time window starts	As soon as a bad minute rolls out of and good minute rolls into the window
Occurrences	On every good event	As soon as a good event replaces the bad event. When frequency of good events increases.

Why it happens

Calendar-aligned & Time slices
Only bad minutes impact calculations since this budgeting method already considers all future (unknown) minutes as good. As a result, reliability and error budget can worsen or remain the same within the same time window.: • Reliability and error budget always restore to 100% once the next time window starts.; • Reliability and error budget never recover partially.; Reliability
(0 good + 0 window length - (0 good + 0 bad) ) / 0 window length = (0 good + 0 window length - 0 total) / 0 window length = (0 good + 0 future ) / 0 window length = 100%
Active time window
Good minute: ⬤Bad minute: ⬤

Calendar-aligned & Occurrences
Events accumulate over the fixed time window. As more good events happen, they improve the ratio of good to total events. As a result, reliability and error budget increase.: • Reliability and error budget always restore to 100% once the next time window starts.; • Reliability and error budget can recover partially.; Reliability
0 good / (0 good + 0 bad) = 0 bad / 0 total = 100%
Active time window
Good data point: ⬤Bad data point: ⬤

Rolling & Time slices / Occurrences + Constant or increasing event density
The time window continuously advances, adding new events or minutes and dropping the old ones on a first-in, first-out (FIFO) basis. When the window moves past a bad event, that event is no longer included in calculations. As good events or minutes replace bad ones, reliability and error budget recover.
Reliability and error budget can restore to 100% if:: • All events or minutes in a time window are good.; • This condition lasts long enough to compensate for previous depletion.; Reliability
0 good / (0 good + 0 bad) = 0 bad / 0 total = 100%
Time window
Good data point: ⬤Bad data point: ⬤
Rolling & Occurrences + Decreasing event density
As the time window advances and drops old (frequent) events, the relative impact of remaining events increases. If good events are dropped, each remaining bad event counts more heavily towards the error budget. Event if new good events arrive, the error budget continues dropping because of the increased influence of the remaining bad events.
Reliability and error budget can restore if:: • All events in a time window are good.; • This condition lasts long enough to compensate for remaining bad events impact.