Observation model for Nobl9 alerting
Nobl9 alerting logic operates alongside the SLO calculations. If an SLO functions properly, generating results such as reliability burn down and remaining error budget, the alerting system will be operational. However, if there are any issues with the SLI or data source that could affect SLO calculations, Nobl9 won't trigger or resolve alerts. If you want to receive notifications when the SLO stops receiving data, you can set up anomaly notifications for your SLOs.
Some definitionsThe observation model (or alerting logic) refers to a system used by Nobl9 to monitor specific parameters (conditions) and trigger alerts when predefined thresholds are met.
You can learn more about SLO inputs in the SLO Inputs section and SLO calculations in the SLO Calculations section.
Alert policy configuration
You can configure an alert policy using one of alertingWindow
| lastsFor
parameters.
- alertingWindow
- lastsFor
Here's an example of an alert policy with the alertingWindow
parameter:
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: fast-burn
project: default
spec:
alertMethods: []
conditions:
- alertingWindow: 30m
measurement: averageBurnRate
value: 20
op: gte
coolDown: 5m
description: "Fast Burn Policy that triggers when the average burn rate based on the last 30m is greater than 20x"
severity: Medium
Here's an example of an alert policy with the lastsFor
parameter:
apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
name: fast-burn
project: default
spec:
alertMethods: []
conditions:
- lastsFor: 15m
measurement: averageBurnRate
value: 20
op: gte
coolDown: 5m
description: "Fast Burn Policy that triggers when the 5-min average burn rate is greater than 20x for at least 15m"
severity: Medium
Though averageBurnRate
is a measurement in both YAMLs above, the burn rate calculations for those two policies will differ.
To learn more about the differences, see types of burn rate.
Differences between alertingWindow
and lastsFor
You can't use both parameters in a single alert policy. You must choose one or the other.
alertingWindow
has a minimum value of 5 minutes, while lastsFor
has no minimum value. If you don't set any of these parameters for backward compatibility reasons, the alert policy will implicitly use the lastsFor
parameter with a default value of 0 minutes
.
While it's just a single parameter inside an alert policy, the choice between alertingWindow
and lastsFor
can have a significant impact on the alerting logic.
Lasts for
In this observation model, each alert policy is re-evaluated with each incoming point using a 5-minute rolling window. Each minute in this window contributes to the overall evaluation based on the error rate in such minutes.
If the error rate is very high for one minute but is low for the next four minutes, the evaluated burn rate will be lower than if the error rate was high for all five minutes. This configuration option is best used with the burnedBudget
alert condition, which doesn't depend on the static 5-minute evaluation window.
This configuration option is best used with burnedBudget
alert condition, that does not depend on the static 5-minute evaluation window.
Alerting window
In this observation model, each alert policy is re-evaluated using a dynamic window configurable by users. An alert policy with alertingWindow: 30m
will produce different results than the alertingWindow: 5m
.
You can think of an alerting window as a period that's being observed by the alert policy. Then, based on the defined alert conditions, Nobl9 processes a set of calculations to determine whether an alert should be triggered based on such observed range.
There are two main families of calculations in Nobl9, which also correlate to two different families of alert conditions:
Which observation model should I choose?
While both burn rate and exhaustion conditions are available for both observation models, we recommend using thealertingWindow
variant.
alertingWindow
allows for more precise alerting and gives you more control over the alerting logic by allowing you to configure the window that's being observed by the alert policy.lastsFor
duration to trigger the alert. This is particularly important for fast-burn policies where every minute counts.alertingWindow
for slow-burn alert policies. It allows the observed range to widen and detect slow but ongoing burn rate changes. It is harder to achieve this result using the lastsFor
method because the context of errors is lost after 5 minutes, so the gradual exhaustion over time is less accurate.alertingWindow
is the ability to set up multi-window, multi-burn rate alert policies on its basis. It's impossible to set such policies using the lastsFor
model.alertingWindow
parameters in a single alert policy effectively allows the simultaneous tracking of more than one burn rate evaluation value. When using the lastsFor
model, the burn rate evaluation is always based on the last 5 minutes, no matter the value of lastsFor
.You can learn more about multi-window, multiburn rate alerts in Google SRE workbook.