Reliability and error budget calculations
This guide answers questions about system reliability and error budgets. We'll explore the following:
- How do reliability and error budgets recover after setbacks?
- Whether good incoming traffic can reduce the impact of a recent incident?
Reliability and error budgetβ
Before diving deeper into these areas, letβs take a look at the definitions of reliability and error budget. These terms are closely related.
Reliabilityβ
Reliability is a ratio of good to total events within the time window of your SLO.
For example, if your SLO's time window is 28 days, then the ratio between all incoming events from the last 28 days and all incoming good events from the same period would be your service reliability:
Reliability = (# good events)/(# total events)
within the SLO's time window
Considering that total = good + bad:
Reliability = (# good events) / (# good + bad events)
within the SLO's time window
Error budgetβ
Error budget represents the percentage of reliability above your SLO's target.
So, for instance, we aim for 95%
reliability. Then, our error budget would be 5%
(100%-95%
) of this value:
objectives:
- displayName: Good response (200)
value: 200
name: ok
target: 0.95
rawMetric:
query:
generic:
query: >-
SINCE N9FROM UNTIL N9TO FROM a1: entities(aws:postgresql:123)
FETCH a1.metrics("infra:database.cpu.utilization",
"aws-cloudwatch"){timestamp, value} LIMITS
metrics.granularityDuration(PT1M)
op: lte
primary: true
Reliability and error budget recoveryβ
As a rule of thumb, both reliability and the error budget recover as a bad event's impact on calculations decreases.
In particular, reliability and error budget recovery depends on the time window type and budgeting method combination:
Budgeting method | Calendar-aligned time windows | Rolling time windows |
---|---|---|
Time slices | Once the next time window starts | As soon as a bad minute rolls out of and good minute rolls into the window |
Occurrences | On every good event | As soon as a good event replaces the bad event. When frequency of good events increases. |
Why it happensβ
- Calendar-aligned & Time slices
- Only bad minutes impact calculations since this budgeting method already considers all future (unknown) minutes as good. As a result, reliability and error budget can worsen or remain the same within the same time window.
- β’ Reliability and error budget always restore to 100% once the next time window starts.
- β’ Reliability and error budget never recover partially.
- Reliability
(0 good + 0 window length - (0 good + 0 bad) ) / 0 window length =
(0 good + 0 window length - 0 total) / 0 window length =
(0 good + 0 future ) / 0 window length =
100%Active time windowGood minute: ⬀Bad minute: ⬀
- Calendar-aligned & Occurrences
- Events accumulate over the fixed time window. As more good events happen, they improve the ratio of good to total events. As a result, reliability and error budget increase.
- β’ Reliability and error budget always restore to 100% once the next time window starts.
- β’ Reliability and error budget can recover partially.
- Reliability
0 good / (0 good + 0 bad) = 0 bad / 0 total = 100%
Active time windowGood data point: ⬀Bad data point: ⬀
- Rolling & Time slices / Occurrences + Constant or increasing event density
- The time window continuously advances, adding new events or minutes and dropping the old ones on a first-in, first-out (FIFO) basis. When the window moves past a bad event, that event is no longer included in calculations. As good events or minutes replace bad ones, reliability and error budget recover.
Reliability and error budget can restore to 100% if: - β’ All events or minutes in a time window are good.
- β’ This condition lasts long enough to compensate for previous depletion.
- Reliability
0 good / (0 good + 0 bad) = 0 bad / 0 total = 100%
Time windowGood data point: ⬀Bad data point: ⬀ - Rolling & Occurrences + Decreasing event density
- As the time window advances and drops old (frequent) events, the relative impact of remaining events increases.
If good events are dropped, each remaining bad event counts more heavily towards the error budget.
Event if new good events arrive,
the error budget continues dropping because of the increased influence of the remaining bad events.
Reliability and error budget can restore if: - β’ All events in a time window are good.
- β’ This condition lasts long enough to compensate for remaining bad events impact.