Alerting—general use case
This guide explores the practical implementation of Nobl9 alerting mechanisms based on service level objectives. We'll review a specific configuration of an Alert Policy, see its lifecycle, and focus on how firing alerts are tied to the cooldown period.
Assumptions
Let's assume you've attached an alert policy to an SLO with the following condition:
The average error budget burn rate is greater or equal to 3 and this condition lasts 10 minutes with a cooldown period set to 15 minutes.
Here's a YAML and UI configuration of this use case:
- YAML configuration
- UI configuration
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: trigger-alert-immediately
displayName: # string, optional
project: # string, optional - if not defined, Nobl9 returns a default value for this field
spec:
description: # string, optional
severity: Medium
cooldown: 15m
conditions:
- measurement: averageBurnRate
value: 3
op: gte
lastsFor: 10m
alertMethod:
- name: discord-notification
project: # string, optional - if not defined, Nobl9 returns a default value for this field
Overview of the alert policy's lifecycle
10:21AM
A single HTTP request with higher latency causes SLO to burn the budget.
A spike on the burn rate graph is observed with a burn rate 5x
.
Fortunately, it is almost immediately resolved.
At 10:23, the burn rate is again 0x.
Alert was not triggered because the alert policy expects from burn rate to last at least 10 minutes.
11:15AM
Our SLO's error budget starts to burn again.
Because of higher traffic, more HTTP requests should impact the error budget.
The budget begins to burn with a burn rate 3.7x
.
At 11:25, an alert is triggered because the alert condition is satisfied.
11:29AM
Our SLO's error budget stops burning.
The burn rate is 0x.
When the alert condition stops being satisfied, the cooldown period starts to be measured.
During the cooldown period, no new alerts are triggered.
If the burn rate is 0x
for another 15 minutes, the alert will be resolved, and a new alert can be triggered.
11:33AM
The burn rate has peaked again (5x
).
It lasts until 11:35, and then the burn rate is 0x
again.
Our alert policy condition is satisfied, and the cooldown counter is stopped.
A new alert was not triggered because the previous alert still needed to be resolved (as the cooldown period was reset).
At 11:35, the cooldown period starts to be measured again.
11:50AM
The burn rate is still 0x
, the cooldown period is satisfied.
The alert is resolved.
11:58AM
The burn rate is 5x
, lasting for the next 5 minutes.
A new alert is triggered.
The diagram below illustrates the lifecycle of the alert policy described above:
Key takeaways
New alerts are not triggered during the cooldown period. The cooldown period is reset when an alert condition is satisfied, even for a while.
However, if, over time, all alert conditions have been satisfied again, the cooldown period is then reset. It will be calculated when any of the conditions stopped to be satisfied.
All conditions must be met for an alert policy with multiple conditions to trigger alerts. However, if you want to begin measuring the cooldown period, you must ensure that at lease one condition is no longer fulfilled.
Avoid adding alert policies that are always satisfied, for example, Average burn rate >= 0. This condition is satisfied when the budget is burning and when the budget is not burning, too.