Observation model for Nobl9 alerting

Reading time: 0 minute(s) (0 words)

Nobl9 alerting logic operates alongside the SLO calculations. If an SLO functions properly, generating results such as reliability burn down and remaining error budget, the alerting system will be operational. However, if there are any issues with the SLI or data source that could affect SLO calculations, Nobl9 won't trigger or resolve alerts. If you want to receive notifications when the SLO stops receiving data, you can set up anomaly notifications for your SLOs.

Some definitions
The observation model (or alerting logic) refers to a system used by Nobl9 to monitor specific parameters (conditions) and trigger alerts when predefined thresholds are met.

note

You can learn more about SLO inputs in the SLO Inputs section and SLO calculations in the SLO Calculations section.

Alert policy configuration

You can configure an alert policy using one of alertingWindow | lastsFor parameters.

alertingWindow
lastsFor

Here's an example of an alert policy with the alertingWindow parameter:

apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
    name: fast-burn
    project: default
spec:
    alertMethods: []
    conditions:
    - alertingWindow: 30m
      measurement: averageBurnRate
      value: 20
      op: gte
    coolDown: 5m
    description: "Fast Burn Policy that triggers when the average burn rate based on the last 30m is greater than 20x"
    severity: Medium

Here's an example of an alert policy with the lastsFor parameter:

apiVersion: n9/v1alpha
kind: AlertPolicy
metadata:
    name: fast-burn
    project: default
spec:
    alertMethods: []
    conditions:
    - lastsFor: 15m
      measurement: averageBurnRate
      value: 20
      op: gte
    coolDown: 5m
    description: "Fast Burn Policy that triggers when the 5-min average burn rate is greater than 20x for at least 15m"
    severity: Medium

caution

Though averageBurnRate is a measurement in both YAMLs above, the burn rate calculations for those two policies will differ.

To learn more about the differences, see types of burn rate.

Differences between `alertingWindow` and `lastsFor`

You can't use both parameters in a single alert policy. You must choose one or the other.

alertingWindow has a minimum value of 5 minutes, while lastsFor has no minimum value. If you don't set any of these parameters for backward compatibility reasons, the alert policy will implicitly use the lastsFor parameter with a default value of 0 minutes.

choose your observation model carefully

While it's just a single parameter inside an alert policy, the choice between alertingWindow and lastsFor can have a significant impact on the alerting logic.

Lasts for

In this observation model, each alert policy is re-evaluated with each incoming point using a 5-minute rolling window. Each minute in this window contributes to the overall evaluation based on the error rate in such minutes.

If the error rate is very high for one minute but is low for the next four minutes, the evaluated burn rate will be lower than if the error rate was high for all five minutes. This configuration option is best used with the burnedBudget alert condition, which doesn't depend on the static 5-minute evaluation window.

This configuration option is best used with burnedBudget alert condition, that does not depend on the static 5-minute evaluation window.

Alerting window

In this observation model, each alert policy is re-evaluated using a dynamic window configurable by users. An alert policy with alertingWindow: 30m will produce different results than the alertingWindow: 5m.

You can think of an alerting window as a period that's being observed by the alert policy. Then, based on the defined alert conditions, Nobl9 processes a set of calculations to determine whether an alert should be triggered based on such observed range.

There are two main families of calculations in Nobl9, which also correlate to two different families of alert conditions:

Which observation model should I choose?

While both burn rate and exhaustion conditions are available for both observation models, we recommend using the alertingWindow variant.

More precise alerting

alertingWindow allows for more precise alerting and gives you more control over the alerting logic by allowing you to configure the window that's being observed by the alert policy.

Immediate alert notifications

One of the benefits of using this model is that alerts are triggered immediately when specified conditions are met based on the observed range. There is no need to wait for the lastsFor duration to trigger the alert. This is particularly important for fast-burn policies where every minute counts.

Better detection of slow burn

We also recommend using alertingWindow for slow-burn alert policies. It allows the observed range to widen and detect slow but ongoing burn rate changes. It is harder to achieve this result using the lastsFor method because the context of errors is lost after 5 minutes, so the gradual exhaustion over time is less accurate.

Multi-window multi-burn alerting

Another benefit of using alertingWindow is the ability to set up multi-window, multi-burn rate alert policies on its basis. It's impossible to set such policies using the lastsFor model.

More flexibility

Having multiple alertingWindow parameters in a single alert policy effectively allows the simultaneous tracking of more than one burn rate evaluation value. When using the lastsFor model, the burn rate evaluation is always based on the last 5 minutes, no matter the value of lastsFor.

tip

You can learn more about multi-window, multiburn rate alerts in Google SRE workbook.

Alert policy configuration​

Differences between alertingWindow and lastsFor​

Lasts for​

Alerting window​

Which observation model should I choose?​

Alert policy configuration

Differences between `alertingWindow` and `lastsFor`

Lasts for

Alerting window

Which observation model should I choose?