Understand alert conditions

Reading time: 0 minute(s) (0 words)

Explore alert conditions in Nobl9 and learn how you can apply them to your SLO monitoring

Alert conditions are rules that determine when Nobl9 will trigger an alert. Each rule corresponds to a specific condition type, defined as a function of time, burn rate, or remaining error budget. You can create up to three conditions for each alert policy. These conditions are essential for error budget monitoring and ensuring effective incident response.

The Remaining error budget would be exhausted in the near or distant future. In this condition, exhaustion time prediction becomes more sensitive as your remaining budget decreases. Once your SLO has no error budget left, even the slightest amount of burn will trigger an alert.

The Entire error budget would be exhausted in the near or distant future. This prediction is based on the allocation of your entire error budget and depends only on the current burn rate. Use it to define alerts based on time rather than the burn rate function and avoid the remaining budget value impacting the prediction.

The Average error budget burn rate is greater or equal to the threshold and lasts for some period. This alert condition helps catch burn rate spikes independently of the burned budget.

The remaining error budget is below the threshold. It allows for the most straightforward configurations that will alert you when you reach a specific level of error budget, regardless of how quickly or slowly you reach it.

The budget drop condition measures the percentage decrease in the error budget, as shown on the Remaining Error Budget chart. It can be used as an alternative to the average burn rate condition.

Prediction-based conditions

Let's evaluate the prediction-based conditions using Remaining budget would be exhausted as an example.

An SRE team uses the Pod readiness transitions SLO to monitor pod state changes, as frequent transitions between ready and not ready states indicate potential instability. The team configured the Remaining budget would be exhausted alerting policy to get notified if the SLO is about to run out the error budget, so they can take preemptive action. They set the cooldown period for 15 minutes to ensure that transient stabilization events, like pods recovering post-deployment, don't continuously trigger new alerts.

The alerting policy is configured as follows:

Alerting window: 10 minutes
Alert condition: Remaining error budget would be exhausted in 4 hours
Cooldown period: 15 minutes

Example

Scenario: a new deployment introduces compatibility issues with the health check service.

Event	Effect
Pods frequently fail the health check	Pods begin restarting ➜ Readiness transitions number is exceeded
SLO reports less good minutes	Error budget progressively burns ➜ This lasts for 10 minutes ➜ Remaining budget would be exhausted within 4 hours
Alert triggers	The SRE team receives an alert notification
SREs roll back the faulty deployment	Pod restarts gradually decrease ➜ Service stabilizes
The SLO starts reporting stable metrics	The cooldown period starts ➜ No repeated alerts during stabilization, although the alerting conditions may persist for some time

Cooldown period ensures that if a similar metric spike momentarily reappears, for example, if the rollback propagation delays, another alert isn’t triggered during the cooldown.

After 15 minutes of stable metrics, the alert automatically resolves, giving the team confidence that the service is back to normal.

Threshold-based conditions

Let's evaluate the threshold-based conditions using Average burn rate exceeds the threshold as an example.

An SRE team uses the API server response latency SLO to evaluate good responses (latency under 200ms) vs. total responses. They configure the Average error budget burn rate alerting policy to detect sharp spikes in latency.

The alerting policy is configured as follows:

Alerting window: 5 minutes
Alert condition: Burn rate exceeds 20x
Cooldown period: 10 minutes

Example

Scenario: A new marketing campaign drives a significant increase in API requests.

Event	Effect
The increased traffic overloads an API server	The API server latency rises ➜ Good minutes drop ➜ This lasts for five minutes
SLO's burn rate spikes to 20x	Indicates a potential bottleneck
Alert triggers	The SRE team receives an alert notification
SREs scale up resources	Load balancer optimized ➜ Traffic stabilizes ➜ The burn rate returns to normal
The SLO starts reporting stable metrics	The cooldown period starts ➜ No repeated alerts during stabilization, although the alerting conditions may persist for some time

If latency issues reappear after the cooldown period due to prolonged traffic or unforeseen bottlenecks, a new alert is triggered for continued monitoring.

Now you know! Learn more about alerting in Nobl9 and start setting up your own alerts

Prediction-based conditions​

Example​

Threshold-based conditions​

Example​

related topics

Prediction-based conditions

Example

Threshold-based conditions

Example