Fast and slow burn
The burn rate is a measure that calculates how fast you use up your error budget. When your burn rate goes above 1
, your service won't meet its SLO in the future if the current error rate continues.
However, the value of the burn rate can drastically change multiple times over a specified period, and it can become tricky to establish if you're going to burn your budget. That's why we often think of the overall exhaustion characteristics and try to distinguish between fast and slow burn. The following guide provides an overview of those use cases.
Nobl9 allows you to configure alert policies that are based on the burn rate characteristics for the Entire / Remaining Error budget would be exhausted and The average error budget burn rate conditions.
The following guide provides an overview of slow and fast burn rate conditions.
Fast burnโ
Fast burn alert conditions can detect short but significant spikes in burn rate over a brief timeframe (usually 30m or less). Use this condition to react quickly to momentary outages or issues with your services that require immediate attention.
Here's an example of a fast burn configuration for the Average Burn Rate is condition:
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: fast-burn
project: default
spec:
alertMethods: []
conditions:
- alertingWindow: 5m
measurement: averageBurnRate
value: 5
op: gte
coolDown: 5m
description: "Fast Burn Policy that triggers when the average burn rate is greater than or equal to 5x for at least 5 minutes"
severity: Medium
Fast burn chartโ
The first line in the chart below represents the budget that has several burn rate spikes (one medium, one small, one big). A fast burn policy can be used here to alert on some of those spikes. In the example below, the alert threshold is set to ~5, enough for the big spike to trigger an alert, but not on the medium and small spike.
Slow burnโ
Slow burn conditions can be used to detect a gradual budget burn that occurs over a prolonged timeframe, usually exceeding 30 minutes.
This approach is handy in detecting problems that do not require immediate attention but must be addressed in due time.
As a rule, the threshold for slow
burn should be smaller than that for fast
burn conditions.
Here's an example of a slow burn configuration for the Average Burn Rate is condition:
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: slow-burn
project: default
spec:
alertMethods: []
conditions:
- alertingWindow: 30m
measurement: averageBurnRate
value: 2
op: gte
coolDown: 5m
description: "Slow Burn Policy that triggers when the average burn rate over last 30m is greater than or equal to 2x"
severity: Medium
Slow burn chartโ
The first line in the chart below represents the budget that has several burn rate spikes (one medium, one small, one big). A slow burn policy can be used here to alert when the burn over a more extended period (including all those spikes) is significant enough for Nobl9 to trigger an alert.
The main difference between slow and fast burn is that a single event can trigger a fast burn, whereas a slow burn typically requires a higher threshold to be reached before it kicks in.
We recommend aiming for a higher lastsFor
threshold with a slow burn condition to prevent it from being triggered too easily.
Multi-window multi-burnโ
The combination of the two conditions above gives us multi-window, multi-burn.
Keep in mind that you can only define such a condition using the alerting window parameter.
You can use such a configuration when you want to be alerted when the steady burn over a long period is significant enough in your SLO, and it is currently burning the budget (so, your SLO has a momentary spike detected by the fast burn part of this preset).
An example use case for multi-multi is when you currently have an outage that requires attention, and youโve already been burning some budget for a longer period.
Multi-window multi-burn conditions prevent alerting you when the slow burn over a long period is significant, but your SLO is recovering the budget.
- apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: fast-burn
project: default
spec:
alertMethods: []
conditions:
- alertingWindow: 15m
measurement: averageBurnRate
op: gte
value: 5
- alertingWindow: 6h
measurement: averageBurnRate
op: gte
value: 2
coolDown: 5m
description: "Multiwindow, multi-burn policy that triggers when your service requires attention and prevents from alerting when you're currently recovering budget"
severity: Medium