Understand alert conditions
Alert conditions are rules that determine when Nobl9 will trigger an alert. Each rule corresponds to a specific condition type, defined as a function of time, burn rate, or remaining error budget. You can create up to three conditions for each alert policy. These conditions are essential for error budget monitoring and ensuring effective incident response.
The Remaining error budget would be exhausted in the near or distant future. In this condition, exhaustion time prediction becomes more sensitive as your remaining budget decreases. Once your SLO has no error budget left, even the slightest amount of burn will trigger an alert.
The Entire error budget would be exhausted in the near or distant future. This prediction is based on the allocation of your entire error budget and depends only on the current burn rate. Use it to define alerts based on time rather than the burn rate function and avoid the remaining budget value impacting the prediction.
The Average error budget burn rate is greater or equal to the threshold and lasts for some period. This alert condition helps catch burn rate spikes independently of the burned budget.
The remaining error budget is below the threshold. It allows for the most straightforward configurations that will alert you when you reach a specific level of error budget, regardless of how quickly or slowly you reach it.
The budget drop condition measures the percentage decrease in the error budget, as shown on the Remaining Error Budget chart. It can be used as an alternative to the average burn rate condition.
Prediction-based conditionsβ
Let's evaluate the prediction-based conditions using Remaining budget would be exhausted as an example.
An SRE team uses the Pod readiness transitions SLO to monitor pod state changes, as frequent transitions between ready and not ready states indicate potential instability. The team configured the Remaining budget would be exhausted alerting policy to get notified if the SLO is about to run out the error budget, so they can take preemptive action. They set the cooldown period for 15 minutes to ensure that transient stabilization events, like pods recovering post-deployment, don't continuously trigger new alerts.
The alerting policy is configured as follows:
- Alerting window: 10 minutes
- Alert condition: Remaining error budget would be exhausted in 4 hours
- Cooldown period: 15 minutes
Cooldown period ensures that if a similar metric spike momentarily reappears, for example, if the rollback propagation delays, another alert isnβt triggered unless the instability persists for an extended period.
After 15 minutes of stable metrics, the alert automatically resolves, giving the team confidence that the service is back to normal. No repetitive alerts are triggered unless conditions worsen significantly after the cooldown.
Threshold-based conditionsβ
Let's evaluate the threshold-based conditions using Average burn rate exceeds the threshold as an example.
An SRE team uses the API server response latency SLO to evaluate good responses (latency under 200ms) vs. total responses. They configure the Average error budget burn rate alerting policy to detect sharp spikes in latency.
The alerting policy is configured as follows:
- Alerting window: 5 minutes
- Alert condition: Burn rate exceeds 20x
- Cooldown period: 10 minutes
If latency issues reappear after the cooldown period due to prolonged traffic or unforeseen bottlenecks, a new alert is triggered for continued monitoring.