Data anomaly troubleshooting
Any of the data anomalies detected by Nobl9 can be caused by reasons outside of Nobl9. This includes issues with a data source or with a data stream itself (like slight changes in a metric behavior).
However, to ensure that SLO data anomalies actually point to an external issue, we recommend checking the correctness of your SLO and SLO objective settings.
This troubleshooting guide will guide you through the issues that you can fix in your SLOs and data source configurations.
No dataโ
The reasons for your SLO not receiving any data, receiving partial data, or not passing the query check verification can be grouped into two broader categories:
- Issues with the data source your SLO is connected to and network connectivity
- Issues with the query configured in your SLO
Reason | How to address |
---|---|
Connection issues for data sources that use the agent method | โข Confirm the connection status of your data source on your data source's details page. โข Review your agent metrics. |
Connection issues for data sources that use the direct method | Examine event logs. |
Inappropriate data source's query parameters | The query interval and timeout must fit the data density: โข If the timeout is small, and data is sparse, requests can fail before the data is emitted. โข Check the query intervalโrequests must be sent with enough frequency to capture new data points. |
Incorrect source-specific settings | Look into source-specific fields: โข Ensure the authentication credentials, URL, and other provided values are correct. โข Verify the validity of any tokens or API keys. |
Rate limits hit | Nobl9 stops collecting data when the data source's API rate limit is reached. Collection resumes once the rate limit is reset. |
Incorrect query | Ensure the syntax is correct according to your data source. |
Network issues | Check for any network errors between Nobl9 and your data source. |
Nobl9 integrations with data sources (regardless of the connection method used) are resistant to temporary network failures while trying to receive data from them. When your data source becomes available again, Nobl9 catches up on the data lost during the brief outage.
If the data source remains unavailable for an extended period and doesn't recover, Nobl9 cannot collect data from it to resume calculations. In such cases, we recommend checking your data source's status page (see below).
Expand to see status pages
Tools you can use
- Checking data source connection
- For data sources connected using the agent method:
- For data sources connected using the direct method:
- Checking a query and targets
- Query checker for the Datadog, Dynatrace, and New Relic SLOs
- SLI Analyzer
Resources you can refer to for troubleshooting:
Specific Prometheus queries can impact SLO calculationsโ
Nobl9 queries to Prometheus can not contain the following functions:
The reason is these three functions extrapolate missing data. Missing timestamps can lead to inconsistent data received by Nobl9.
Any function using a range vector (like rate
, increase
, irate
)
can introduce another issue because Nobl9 requests data at a specific granularity
(e.g., 15 seconds for Prometheus).
Range vector queries operate over a different interval (e.g., [5m]
).
Attempts to align these intervals potentially lead to unpredictable data.
Range vector queries introduce an interval (in PromQL, itโs represented by [x]
where x
is the duration, such as 5m
), and it's hard to match the data intake interval with the aggregation function intervalโthose two might not overlap, and so the data will be unpredictable.
Constant and no burnโ
These data anomaly types are caused by either too strict or too lenient SLO objective settings. The following settings have an impact:
- Target
- Numerator (
good
orbad
) query - Denominator (
total
) query
Burn type | Threshold SLOs | Ratio SLOs |
---|---|---|
Constant burn | The threshold target is too high | The ratio target is too high The numerator is too restrictive The denominator is too broad, or queries irrelevant data |
No burn | The threshold target is too low | The ratio target is too low The numerator and denominator queries are nearly identical |
In either case, the SLO fails to reflect the actual state of your system.
Tools you can use:
- SLI Analyzer to experiment with different settings and determine a more optimal target
- Query checker for the Datadog, Dynatrace, and New Relic SLOs
Reimport your historical SLI dataโ
Once the issue is resolved, we recommend replaying your SLO to refill it with historical SLI data for the period when the data anomaly was detected.