Skip to main content

Increasing SLO sensitivity to incidents

Reading time: 0 minute(s) (0 words)

This guide shows how to configure an SLO to make it more sensitive to incidents. First, let’s take a look at how an SLO behaved in the course of an incident:

alerting center example
Image 1: An overview of SLO behavior during an incident

Burn rate and error budget​

The two most important signals that SLO outputs provide are the error budget and the current burn rate. Before discussing appropriate alerting mechanisms, let’s examine these signals during an incident.

When analyzing the SLO during an active incident, we can observe that approximately 5% of the error budget has been consumed. During this period, the burn rate has remained mostly below 1x.

A burn rate consistently under 1x throughout the entire time window indicates that the SLO is performing well and will not exhaust the entire error budget before the time window ends. For instance, an average burn rate of 0.75x over the entire time window implies that 25% of the budget will remain at the end of the period.

Burn rate: 1x

Another example: a burn rate of 3x over the entire time window means that we will end up with -200% of our budget at the end of the period. This is equivalent to achieving 97% reliability for an SLO with a target of 99%.

Burn rate: 3x

tip

Play around with chart's slider to see how burn rate impacts your error budget.

SLI configuration​

Considering this, regardless of the incident duration, you were never at risk of burning through the budget because the burn rate stayed below 1x for most of the time. By understanding and validating the input signals, we can adjust this SLO to be more responsive to incidents.

Let’s break down the SLI:

Query for good events:

SELECT count(*) FROM requests WHERE statusCode <> 500;

Query for total events:

SELECT count(*) FROM requests;

We define a good event as any instance where statusCode != 500. This approach seems sound in theory. However, in this case, even if some 500 errors occur during the incident, there are significantly more responses with different status codes. For example:

At 19:33:30 we had 6 bad events and 142 good events:

alerting center example
Image 4: Good event in SLI metric

At 12:29 we had 1 Bad Event, 112 Good Events:

alerting center example
Image 5: Good event in SLI metric

Configuring budgeting method​

Both of the above cases occurred during the incident. To make the SLO more sensitive to incidents, increase its capability to burn the budget during every minute of an incident. There are several ways to achieve this, but based on how this SLI behaves during an incident, we recommend applying the time slices method instead of occurrences occurrences.

The time slices method divides the SLO time window into good minutes and bad minutes. For example, any minute with a status code 500 event can be marked as a Bad minute. This can be achieved by setting a 100% time slice allowance. You can learn more about this method in our documentation.

alerting center example
Image 5: Time slice configuration

Useful resources​

For a more in-depth look, consult additional resources: