Fast and slow burn
The burn rate is a measure that calculates how fast you use up your error budget. When your burn rate goes above
1, your service won't meet its SLO in the future if the current error rate continues.
However, the value of the burn rate can drastically change multiple times over a specified period, and it can become tricky to establish if you're going to burn your budget. That's why we often think of the overall exhaustion characteristics and try to distinguish between fast and slow burn. The following guide provides an overview of those use cases.
Nobl9 allows you to configure alert policies that are based on the burn rate characteristics for the Entire / Remaining Error budget would be exhausted and The average error budget burn rate conditions.
Fast burn alerting conditions can detect short but significant spikes in burn rate over a small period (usually 30m or less). Use this condition to react quickly to momentary outages or issues with your services that require immediate attention.
This approach is handy in quickly identifying any potential problems with your services that demand immediate attention. These conditions are set to identify short but sizable spikes in burn rate that occur over a brief timeframe, often 30 minutes or less. This approach will prompt you to respond promptly to momentary outages or other complications.
Here's an example of a fast burn configuration for the Average Burn Rate is condition:
- lastsFor: 5m
description: "Fast Burn Policy that triggers when the average burn rate is greater than or equal to 5x for at least 5 minutes"
Fast Burn chart
The first line in the chart below represents the budget that has several burn rate spikes (one medium, one small, one big). A fast burn policy can be used here to alert on some of those spikes. In the example below, the alert threshold is set to ~5, enough for the big spike to trigger an alert, but not on the medium and small spike.
Slow burn conditions can be used to detect a gradual budget burn that occurs over a prolonged timeframe, usually exceeding 30 minutes.
This approach is handy in detecting problems that do not require immediate attention but must be addressed in due time.
As a rule, the threshold for
slow burn should be smaller than that for
fast burn conditions.
Here's an example of a slow burn configuration for the Average Burn Rate is condition:
- lastsFor: 30m
description: "Slow Burn Policy that triggers when the average burn rate over last 30m is greater than or equal to 2x"
Slow Burn chart
The first line in the chart below represents the budget that has several burn rate spikes (one medium, one small, one big). A slow burn policy can be used here to alert when the burn over a more extended period (including all those spikes) is significant enough for Nobl9 to trigger an alert.
The main difference between slow and fast burn is that a single event can trigger a fast burn, whereas a slow burn typically requires a higher threshold to be reached before it kicks in.
We recommend aiming for a higher
lastsFor threshold with a slow burn condition to prevent it from being triggered too easily.