Skip to main content

Data anomaly troubleshooting

Reading time: 0 minute(s) (0 words)

Any of the data anomalies detected by Nobl9 can be caused by reasons outside of Nobl9. This includes issues with a data source or with a data stream itself (like slight changes in a metric behavior).

However, to ensure that SLO data anomalies actually point to an external issue, we recommend checking the correctness of your SLO and SLO objective settings.

This troubleshooting guide will guide you through the issues that you can fix in your SLOs and data source configurations.

No dataโ€‹

The reasons for your SLO not receiving any data, receiving partial data, or not passing the query check verification can be grouped into two broader categories:

  • Issues with the data source your SLO is connected to and network connectivity
  • Issues with the query configured in your SLO
ReasonHow to address
Connection issues for data sources that use the agent methodโ€ข Confirm the connection status of your data source on your data source's details page.
โ€ข Review your agent metrics.
Connection issues for data sources that use the direct methodExamine event logs.
Inappropriate data source's query parametersThe query interval and timeout must fit the data density:
โ€ข If the timeout is small, and data is sparse, requests can fail before the data is emitted.
โ€ข Check the query intervalโ€”requests must be sent with enough frequency to capture new data points.
Incorrect source-specific settingsLook into source-specific fields:
โ€ข Ensure the authentication credentials, URL, and other provided values are correct.
โ€ข Verify the validity of any tokens or API keys.
Rate limits hitNobl9 stops collecting data when the data source's API rate limit is reached. Collection resumes once the rate limit is reset.
Incorrect queryEnsure the syntax is correct according to your data source.
Network issuesCheck for any network errors between Nobl9 and your data source.
Timestamp persistence feature

Nobl9 integrations with data sources (regardless of the connection method used) are resistant to temporary network failures while trying to receive data from them. When your data source becomes available again, Nobl9 catches up on the data lost during the brief outage.

If the data source remains unavailable for an extended period and doesn't recover, Nobl9 cannot collect data from it to resume calculations. In such cases, we recommend checking your data source's status page (see below).

Expand to see status pages

Tools you can use

Resources you can refer to for troubleshooting:

Specific Prometheus queries can impact SLO calculationsโ€‹

Nobl9 queries to Prometheus can not contain the following functions:

The reason is these three functions extrapolate missing data. Missing timestamps can lead to inconsistent data received by Nobl9.

Any function using a range vector (like rate, increase, irate) can introduce another issue because Nobl9 requests data at a specific granularity (e.g., 15 seconds for Prometheus). Range vector queries operate over a different interval (e.g., [5m]). Attempts to align these intervals potentially lead to unpredictable data.

Range vector queries introduce an interval (in PromQL, itโ€™s represented by [x] where x is the duration, such as 5m), and it's hard to match the data intake interval with the aggregation function intervalโ€”those two might not overlap, and so the data will be unpredictable.

Constant and no burnโ€‹

These data anomaly types are caused by either too strict or too lenient SLO objective settings. The following settings have an impact:

  • Target
  • Numerator (good or bad) query
  • Denominator (total) query
Burn typeThreshold SLOsRatio SLOs
Constant burnThe threshold target is too highThe ratio target is too high
The numerator is too restrictive
The denominator is too broad, or queries irrelevant data
No burnThe threshold target is too lowThe ratio target is too low
The numerator and denominator queries are nearly identical

In either case, the SLO fails to reflect the actual state of your system.

Tools you can use:

  • SLI Analyzer to experiment with different settings and determine a more optimal target
  • Query checker for the Datadog, Dynatrace, and New Relic SLOs

Reimport your historical SLI dataโ€‹

Once the issue is resolved, we recommend replaying your SLO to refill it with historical SLI data for the period when the data anomaly was detected.

Check out these related guides and references: