No data—troubleshooting
The reasons for your SLO not receiving any data, receiving partial data, or not passing the query check verification can be grouped into two broader categories:
-
Issues with the data source your SLO is connected to.
-
Issues with the query configured in your SLO.
Issues with data sources
Before more in-depth troubleshooting, we advise to:
-
Verify the connection status of your data source in the Integrations tab.
-
Check the configuration of your data source: ensure that your authentication credentials (that is, API Keys), URL, and other source-specific values are correct.
-
Check your data source’s logs to identify any returned errors.
If you’ve completed the above steps, and you’re still facing the issue with your data source, refer to the section below of this document.
Network issues
Nobl9 SLO calculations are prone to errors in cases where the Nobl9 agent can’t gather all the necessary data from data sources. This issue might occur when, for example, there are network issues between Nobl9 and the respective data source. Refer to Agent troubleshooting for more details.
Nobl9 agents are resistant to temporary network failures while trying to receive data from external sources. Whenever the source becomes available again, the agent catches up on the data lost during the brief outage. If the source becomes unavailable for an extended period and doesn't recover, Nobl9 can't collect data from it and process resume calculations. In such cases, we recommend checking your source's status page (see below).
Expand to see status pages
Service Name | Link |
---|---|
Amazon Cloudwatch | Click to go to the status page |
Amazon Prometheus | Click to go to the status page |
Amazon Redshift | Click to go to the status page |
AppDynamics | Click to go to the status page |
BigQuery | Click to go to the status page |
Datadog | Click to go to the status page |
Dynatrace | Click to go to the status page |
Elasticsearch | Click to go to the status page |
Google Cloud Monitoring | Click to go to the status page |
Grafana Loki | Click to go to the status page |
Graphite | Click to go to the status page |
InfluxDB | Click to go to the status page |
Instana | Click to go to the status page |
ServiceNow Cloud Observability | Click to go to the status page |
New Relic | Click to go to the status page |
OpenTSDB | Click to go to the status page |
Pingdom | Click to go to the status page |
Splunk | Click to go to the status page |
Splunk Observability | Click to go to the status page |
Sumo Logic | Click to go to the status page |
ThousandEyes | Click to go to the status page |
Rate limiting
When integrating with Data sources, Nobl9 agent must comply with the rate limits set by their APIs. Strict rate limits can be the reason for Nobl9 agent to stop collecting data.
For more details on the API rate limits for each data source, refer to the API Rate Limits sections in each data source documentation.
Issues with queries
Any time you enter a wrong query in Nobl9, it’s likely that your burn rate calculations will be incorrect. Nobl9 offers simple validations for the input for metrics that ensure you provide all the required values necessary to process your SLI data, but it doesn’t handle more complex queries. Incorrect queries can have an impact on both threshold and ratio-type metrics. In general, burn rate calculations can be incorrect if:
-
Queries return unexplainable, unpredictable data or no data at all.
-
good
orbad,
andtotal
queries are misplaced—this happened in the past and is easily discoverable.good
will always be greater than thetotal
in this case.
Incorrect SLO configuration
Most often, queries return incorrect data when SLO’s data collection method is set to incremental
while its SLI data is in fact non-incremental
.
Most often, queries return incorrect data when SLO’s data collection method is set to incremental
while its SLI data is, in fact, non-incremental
. Check the SLO calculations guide for more details about incremental metrics.
If your SLI data is non-incremental, remember to set the Data Count Method to non-incremental on the Nobl9 Web (screenshot below), or set the value to incremental: false
in your YAML definition in sloctl
. Here's a YAML definition for an incremental metric in Prometheus:
[...]
objectives:
- target: 0.75
countMetrics:
good:
amazonPrometheus:
promql: sum(production_http_response_time_seconds_hist_bucket{method=~"GET|POST",status=~"2..|3..",le="1"})
incremental: true
total:
amazonPrometheus:
promql: sum(production_http_response_time_seconds_hist_bucket{method=~"GET|POST",le="+Inf"})
displayName: available1
timeSliceTarget: 0.75
value: 1
Here's how you can set a ratio metric to the incremental/non-incremental type on the Nobl9 Web:
The incremental
flag impacts how SLO calculations should be processed. This flag is meant to be set to true for those SLOs, whose queries provide Nobl9 with incremental data. By incremental data, we mean: value v
that for each point in time, t
, is always greater than or equal to previous value (which is a linearly increasing function):
v(t) ≤ v(t+1)
Specific Prometheus queries can impact SLO calculations
Nobl9 queries to Prometheus can’t contain the following functions:
-
rate
- Query functions | Prometheus -
increase
- Query functions | Prometheus -
irate
- Query functions | Prometheus
These three types of functions extrapolate missing data, so if different timestamps are missing each time the Nobl9 agent queries the data source, it’s possible to receive inconsistent data.
Any function with a range vector as a parameter (such as rate
, increase
, irate,
and more) is potentially vulnerable to another type of issue since Nobl9 asks for data with a given granularity (15 sec for Prometheus).
Range vector queries introduce an interval (in PromQL, it’s represented by [x]
where x
is the duration, such as 5m
), and it's hard to match the data intake interval with the aggregation function interval—those two might not overlap, and so the data will be unpredictable.
For query issues related to other sources, check specific Sources documentation.
Troubleshooting complete: reimport your historical SLI data
Once you’ve resolved your issues, we advise running Replay for your SLO to refill it with historical SLI data for the period when your SLO wasn’t collecting any data or was collecting data only partially.