Skip to main content

Data anomaly detection

Reading time: 0 minute(s) (0 words)

Data anomaly detection is crucial for maintaining reliable SLO monitoring. When the data stream from your data source has any deviations, your SLOs cannot be calculated properly, creating blind spots in your observability.

Nobl9 offers two ways to detect data anomalies:

  • Manually configured detection of no data. It is available for everyoneβ€”you set up notifications for your SLO. These notifications trigger when this SLO stops reporting data for a duration you set.
  • Automatic data anomaly detection. This is an advanced feature for the Nobl9 Enterprise Edition. With it, four data anomaly types are detected automatically. The auto-detection feature is enabled for all SLOs and uses centralized defaults that can be customized on demand.

Once a data anomaly is detected, Nobl9 creates an SLO annotation with details about the anomaly and a link to the affected SLO.

Manual configuration and auto-detection both operate fully and seamlessly alongside each other.

This article describes the data anomalies in detail and suggests troubleshooting steps for each anomaly type.

No data manual configuration​

With manually configured No data anomaly detection, you're setting up notifications for no data. The notifications are sent using one of the supported alert methods. Ensure you have access to at least one of the following:

You can set up data anomaly detection:

To configure data anomaly detection in the Nobl9 Web application, select No data anomaly alert in Step 5 of the SLO wizard. Then, specify how long your SLO must wait for data before sending the notification and select your preferred alert method:

no data alert
Setting up no data notification on the Nobl9 Web
  • You can add up to five alert methods for your manual no data anomaly notifications per SLO
  • To receive notifications for no data anomalies, you must have access to both the SLO and the alert method it uses
  • Query parameters, like query delay, can affect when Nobl9 sends notifications for missing data and the duration of the corresponding annotations. This creates a difference in timestamps: SLO charts use the time from the data source, while notifications and annotations are based on when Nobl9's query confirms the anomaly.
annotations closure after a data point is received
Anomaly annotation is closed after a data stream resumes
example for the query delay = 5 minutes and alert after = 10 minutes
Data pointTime in a data sourceTime in Nobl9No data anomaly detection time
Last data point before no data period13:00:0013:05:0013:15:00
First data point after no data period14:00:0014:05:0014:15:00
Refer to troubleshooting steps to address no data anomalies

Data anomaly auto-detection
Enterprise
​

Automatic data anomaly detection is available exclusively in the Nobl9 Enterprise Edition

The following data anomaly types are detected:

Data anomalyDescriptionDefault waiting timeDefault cooldown
No dataNo data is being reported by an SLO objective for one week1 weekNo cooldown
Constant burnAn SLO objective is constantly burning its error budget for an unusually long time1 week30 minutes
No burnAn SLO objective is not burning its error budget for an unusually long time8 weeksNo cooldown
Incremental mismatchA ratio SLO, configured with the incremental data count method, has received a non-incremental data pointTriggered immediately1 day
Customization capabilities

Contact Nobl9 Support to set a custom waiting time or cooldown interval for your organization.

No data​

This data anomaly type employs a longer-term monitoring mechanism than a manually configured one. It identifies SLOs that have ceased reporting data for an extended period, with a default of one week, to catch silent, persistent failures such as misconfigured queries that return no data or data source downtimes.

Troubleshooting steps​

Most common causesFixes
Data source connection issuesCheck the connection status
Incorrect query syntax or configurationReview the query
Network connectivity issuesCheck a status page of your data source
Data source configuration issuesExamine query parameters (query interval and timeout settings)
Errors in data source specific fieldVerify if source-specific settings are relevant. Pay attention to authentication details and URLs

Constant and no burn​

An SLO's error budget should burn and recover in a way that reflects the actual performance patterns of your service. A burn rate that is either consistently high or completely flat often signals a problem. This deviation can indicate a misconfigured SLO. It can also suggest problems with the SLO's interpretation of the underlying data that feed the SLI calculations, so it may not provide a faithful representation of your system's reliability.

Nobl9 automatically detects these patterns and creates a data anomaly annotation with the details of such an issue.

  • Constant burn is detected when an SLO continuously consumes its error budget without any periods of recovery. This can be a sign that the SLO is not providing a complete picture using the data it receives.
  • No burn is detected when an SLO shows perfect or near-perfect reliability over an extended period. While this may seem ideal, it often means the SLO is not sensitive enough to be a meaningful indicator of service quality.

Both scenarios render an SLO ineffective as a reliability measurement tool. A properly configured SLO should burn error budget during actual service degradation and recover during normal operations, providing meaningful insights into your system's health and user experience.

Nobl9's default settings for the constant and no burn are specified in the table below. For the Constant burn data anomaly, Nobl9 also applies a cooldown. During the cooldown, it ignores the signs of constant burn for the same objective and waits for the burn rate to stabilize. If the burn rate doesn't stabilize within the cooldown period, Nobl9 starts the countdown for another constant burn data anomaly.

Data anomalyDefault waiting timeCooldown
Constant burn1 week30 minutes
No burn8 weeksNo cooldown
Interpreting data anomalies

While we do our best to help you identify misconfigurations, the range of potential causes is too extensive to guarantee that all data anomalies are solely due to misconfigurations.

That is why a Constant burn data anomaly may be triggered by an actual incident on your system side, instead of a misconfiguration.

Similarly, for No burn, most cases should point to misconfigurations; however, some data anomalies may be produced for a normal, healthy SLO that simply has not burned its error budget.

Treat data anomalies as hints rather than definitive signals that demand an immediate response.

Troubleshooting steps​

Most common causes:

  • Threshold SLOs
    • The target of at least one SLO objective is too strict, causing a constant error budget burn
    • SLO objective's target is too lenient, which prevents the error budget from burning even when the system is not performing well
  • Ratio SLOs
    • The numerator is too narrow, and the denominator is too broad, leading to a constant error budget burn
    • The numerator and denominator count almost the same set of events, resulting in the reliability of nearly 100% and no error budget burn

Fixes:

  • Use SLI Analyzer to test different target settings
  • Adjust targets based on historical performance data
Read more about advanced troubleshooting

Incremental mismatch​

This data anomaly can be detected in ratio SLOs only with the data count method set to Incremental.

For every ratio SLO, the data count method must be specified. It depends on the incoming data stream and can be incremental or non-incremental.

  • Non-incremental metric values can increase and decrease arbitrarily over the time window. Non-incremental SLI charts appear as a sawtooth or step function
    Example: you're monitoring a CPU load, which can increase and decrease arbitrarily
  • Incremental metrics are cumulative. They are characterized by constantly increasing values. Incremental SLI charts are represented by an increasing line that typically drops at the beginning of a new time window
    Example: you're monitoring the total number of requests in a web application

When your incremental ratio SLO receives a non-incremental data point (i.e., a value lower than the previous one), Nobl9 identifies this as an incremental mismatch data anomaly. It then creates an annotation on the SLI chart of the affected objective. To prevent clutter from excessive annotations, a cooldown period is initiated, which lasts for one day. Each objective in your SLOs can only have one incremental mismatch data anomaly per day.

Troubleshooting steps​

The cause:

  • The data count method is set incorrectly

Fixes:

  • Create another SLO, specifying the proper data count method
    This setting isn't available for editing, so a new SLO is the only available option
  • Carefully explore the incoming data stream to determine the correct data count method
  • Use SLI Analyzer to try different data count methods

Data anomalies vs. alerts​

While both data anomalies and alerts can create annotations on an SLO objective, they are distinct concepts with different purposes and configurations. The following table summarizes their key differences.

ParameterData anomalyAlert
DefinitionAn automatically detected deviation from expected data patterns, common to all SLOsA user-defined notification triggered when specific, configured conditions are met
FocusThe integrity of incoming dataSLO's error budget or error budget burn rate
TriggerBuilt-in system logic that analyses data patternsA specific, user-configured alert policy
RulesSystem-defined and cannot be changed by the userCustomizable by the user
NotificationsDoesn't send notifications using alert methods
Except for manual No data anomalies
Sends notifications using configured alert methods
SilencingCannot be silencedCan be silenced based on user configuration
Manually configured "No data" detection rules

Manually configured No data detection rule follows the same logic as the auto-detection rule but must be enabled for an SLO explicitly and allow you to customize the waiting time and choose a specific alert method for notifications.

Check out these related guides and references: