Increasing SLO sensitivity to incidents
This guide shows how to configure an SLO to make it more sensitive to incidents. First, letβs take a look at how an SLO behaved in the course of an incident:
Burn rate and error budgetβ
The two most important signals that SLO outputs provide are the error budget and the current burn rate. Before discussing appropriate alerting mechanisms, letβs examine these signals during an incident.
When analyzing the SLO during an active incident, we can observe that approximately 5%
of the error budget has been consumed. During this period, the burn rate has remained mostly below 1x
.
A burn rate consistently under 1x
throughout the entire time window indicates that the SLO is performing well and will not exhaust the entire error budget before the time window ends. For instance, an average burn rate of 0.75x
over the entire time window implies that 25%
of the budget will remain at the end of the period.
Another example: a burn rate of 3x
over the entire time window means that we will end up with -200%
of our budget at the end of the period. This is equivalent to achieving 97%
reliability for an SLO with a target of 99%
.
Play around with chart's slider to see how burn rate impacts your error budget.
SLI configurationβ
Considering this, regardless of the incident duration, you were never at risk of burning through the budget because the burn rate stayed below 1x
for most of the time. By understanding and validating the input signals, we can adjust this SLO to be more responsive to incidents.
Letβs break down the SLI:
Query for good events:
SELECT count(*) FROM requests WHERE statusCode <> 500;
Query for total events:
SELECT count(*) FROM requests;
We define a good event as any instance where statusCode != 500
. This approach seems sound in theory. However, in this case, even if some 500
errors occur during the incident, there are significantly more responses with different status codes. For example:
At 19:33:30
we had 6 bad events and 142 good events:
At 12:29
we had 1 Bad Event, 112 Good Events:
Configuring budgeting methodβ
Both of the above cases occurred during the incident. To make the SLO more sensitive to incidents, increase its capability to burn the budget during every minute of an incident. There are several ways to achieve this, but based on how this SLI behaves during an incident, we recommend applying the time slices method instead of occurrences occurrences.
The time slices method divides the SLO time window into good minutes and bad minutes. For example, any minute with a status code 500
event can be marked as a Bad minute. This can be achieved by setting a 100%
time slice allowance. You can learn more about this method in our documentation.