Query customization
By default, Nobl9 queries a data source every minute for the last-minute data.
However, data sources can have limitations related to the consistency of the data, rate, and resource constraints on the platform, or network overhead.
For smooth operations and consistent data flow to Nobl9, you can configure querying behavior with the following parameters:
- Query interval
- Query delay
- Jitter
- Timeout
This article explains how they work and impact other Nobl9 features. Also, it highlights their interconnection and dependencies for correct configuration to avoid multiple queries running simultaneously and unexpected effects.
Currently, only query delay configuration is available on the Nobl9 Web, sloctl
, and Terraform.
To modify the rest of the parameters, contact Nobl9 support
Query intervalβ
The query interval sets how often Nobl9 queries a data source. It also outlines the period of data retrieval.
Suppose it's 15:00 now.
The query interval = 2min
.
Nobl9 sends a query every 2 minutes for the data collected by the data source over the latest 2-minute interval.
Query sending time | Data retrieval period |
---|---|
15:00 | 14:58-15:00 |
15:02 | 15:00-15:02 |
15:04 | 15:02-15:04 |
β To retrieve fresh data, set a smaller query interval value.
For example, when you need to get alerts as soon as possible.
However, you can hit the data source's rate limits.
β To reduce the number of requests keeping the same data retrieval period,
increase the query interval.
For example, with a 1-minute query interval, Nobl9 sends five queries to retrieve data for the 5 minutes. The same data is retrieved with one query when the interval is 5 minutes.
The higher the value, the more outdated data is retrieved.
Interaction with other parametersβ
- Jitter must be less or equal to the query interval
Otherwise it can cause data to come in out-of-order, which can lead to data being rejected. - Timeout must be less or equal to the query interval
Timeout longer than the query interval can cause multiple simultaneous queries and unexpected effects.
Impact on Nobl9 featuresβ
- Alerting: high values can delay important alerts
For example, the query interval = 30 minutes, and the query delay is 0. The incident occurred 20 minutes ago, and Nobl9 sent the last query 25 minutes ago. The agent will query this period data within the next 5 minutes. So, there will be at least a 25-minute lag in the notification. - Service Health dashboards: can cause services falling into the No data category
With the query interval value greater than or equal to the dashboard's time window, Nobl9 lacks time to accumulate data for categorization.
Nobl9 doesn't send alerts on incidents that took place 1 hour ago or later.
Query delayβ
Query delay is the amount of time that Nobl9 waits before sending a query. The delay is counted from the time set by the query interval. Although the query delay postpones sending a query, it doesn't affect the data retrieval period. Regardless of the query delay value, Nobl9 will always query for data as it is set by the query interval.
Shortly, the query delay helps solve issues with inconsistent or unavailable in the data source data.
Suppose it's 15:00 now.
The query interval = 2min
.
Query sent at, query delay = 0 min | Query sent at, query delay = 1 min | Data retrieval period |
---|---|---|
15:00 | 15:01 | 14:58-15:00 |
15:02 | 15:03 | 15:00-15:02 |
15:04 | 15:05 | 15:02-15:04 |
15:06 | 15:07 | 15:04-15:06 |
The above example shows that the data retrieval period is the same, regardless of whether the query is sent with a delay or without it.
It illustrates the way to let the data source time for proper indexing data across all its partitions.
This guarantees the data retrieved is fresh and consistent.
The particular value of the query delay required to ensure data consistency depends on the data source.
Read more about data consistency models.
β To obtain insights and alerts faster, reduce the query delay.
β To ensure the data retrieved is already aggregated and consolidated, increase the query delay.
When there's a discrepancy in the data between a data source and Nobl9, querying for the last minute can yield different results with each query compared to querying for the same period 10 minutes later.
Impact on Nobl9 featuresβ
- Replay: can affect results
For instance, if the query delay is set to 15 minutes or more, and the Replay time window is narrow (less than 3 hours, depending on the data source), Replay can import data faster than the query delay duration, overwriting a small portion of already collected data. It can cause inconsistencies and short periods of incorrect results. - Alerting: can delay important alerts
For example, the query interval is set to 1 minute and the query delay is 10 minutes. An incident occurred 10 minutes ago, and the last query was 1 minute ago. It can result in 10-11 minute lag in the notificationβonce the agent requests the relevant time range. - Service Health dashboards: can cause services falling into the No data category
With the query delay value greater than or equal to the dashboard's time window, Nobl9 lacks time to accumulate data for categorization.
Read more about query delay.
Jitterβ
Jitter refers to an interval when randomized data source querying is allowed.
The agent starts counting the jitter immediately after the query interval or query delay (if any) and then sends the query randomly within the defined range.
Suppose it's 15:00.
The query interval = 2 min
.
Jitter value | Query sending time | Data retrieval period |
---|---|---|
0 | 15:00 | 14:58-15:00 |
30 sec | At a random point between 15:00:00 and 15:00:30 | 14:58-15:00 |
β The smaller the jitter value, the higher the query density for SLOs with more than one SLI.
When your SLO covers a significant amount of SLIs from a single data source, small jitter values can cause data source overload.
β Increasing collection jitter can be helpful when a data source has significant spiky loads.
A higher collection jitter spreads the load over the query interval.
The lowest query density is when the collection jitter value is equal to the query interval.
Interaction with other parametersβ
- Query interval + query delay: must be greater or equal to collection jitter
Otherwise, it can cause data to come in out-of-order, which can lead to some data being rejected.
For SLOs based on data sources connected with the direct method, we recommend jitter values < 1 min
.
Impact on Nobl9 featuresβ
- SLOs: when your SLO covers several SLIs based on the same data source, low collection jitter values can exhaust the data source rate limits.
Timeoutβ
Timeout refers to the duration Nobl9 will wait for a response after sending a request to a data source.
If the data source responds at any point within the timeout range, the query is considered successful.
If the data source fails to respond at the end of the timeout period, the query is considered a failure.
In either case, the subsequent queries run as set by other parameters, with no regard to the time when the previous query is finished.
Suppose it's 15:00.
The query interval = 2 min
.
The timeout = 15 sec
.
Query sending time | Waiting till | Next query sending time |
---|---|---|
15:00:00 | 15:00:15 | 15:02:00 |
15:02:00 | 15:02:15 | 15:04:00 |
β Increase the timeout when the agent reports timeouts, especially for long-running complex queries.
Interaction with other parametersβ
- Query interval + query delay: must be greater or equal to the timeout.
While timeout doesn't affect the querying behavior (the subsequent request starts when the interval passes, disregarding when the previous query finished), timeout longer than the query interval + delay can lead to multiple queries running simultaneously, which can have unexpected effects.
Impact on Nobl9 featuresβ
Replay: too low timeout values cause Replay to fail. Low timeout hinders the data source from returning data in time.
Default query parametersβ
Data source | Query interval | Query delay | Jitter | Timeout |
---|---|---|---|---|
Generic | 1min | 0 | 15s | 30s |
Amazon CloudWatch | 1min | 1min | 15s | 30s |
Amazon Prometheus | 1min | 0 | 15s | 30s |
Amazon Redshift | 1min | 30s | 15s | 30s |
AppDynamics | 1min | 1min | 15s | 30s |
Azure Monitor beta | 1min | 5min | 15s | 60s |
Azure Monitor managed service for Prometheus beta | 1min | 0 | 15s | 30s |
BigQuery | 1min | 0 | 15s | 30s |
Datadog | 2min | 1min | 15s | 30s |
Dynatrace | 1min | 2min | 15s | 30s |
Elasticsearch | 1min | 1min | 15s | 30s |
Google Cloud Monitoring | 1min | 2min | 15s | 50s |
Grafana Loki | 1min | 1min | 15s | 30s |
Graphite | 1min | 1min | 15s | 30s |
InfluxDB | 1min | 1min | 15s | 60s |
Instana | 1min | 1min | 15s | 30s |
LogicMonitor beta | 1min | 2min | 15s | 30s |
New Relic | 1min | 1min | 15s | 30s |
OpenTSDB | 1min | 1min | 15s | 30s |
Pingdom | 1min | 1min | 15s | 30s |
Prometheus | 1min | 0 | 15s | 30s |
ServiceNow Cloud Observability | 1min | 2min | 15s | 30s |
Splunk | 1min | 5min | 20s | 30s |
Splunk Observability | 1min | 5min | 15s | 30s |
Sumo Logic | 2min | 4min | 30s | 30s |
ThousandEyes | 1min | 1min | 15s | 60s |
Query parameters retrievalβ
You can retrieve values for jitter, timeout, and query interval parameters for agents and directs using the sloctl get [directs | agents]
command for sources that support these parameters. See the table below for minimum agent versions.
Currently, these values are read-only. If you change and apply them through a YAML definition of an agent or direct, it won't make any changes to the source.
apiVersion: n9/v1alpha
kind: Agent
metadata:
name: my-source
project: default
spec:
...
sumoLogic:
...
releaseChannel: beta
interval:
unit: Minute
value: 2
jitter:
unit: Second
value: 30
queryDelay:
minimumAgentVersion: 0.65.0-beta09
unit: Minute
value: 5
timeout:
unit: Second
value: 30
Scope of supportβ
The following table shows minimum versions for agents supporting retrieval of query parameters.
To change values for the query interval, jitter, and timeout, contact Nobl9 support.
Data source | Minimum agent version |
---|---|
Amazon CloudWatch | Beta: 0.71.0-beta Stable: 0.73.2 |
AmazonPrometheus | Beta: 0.71.0-beta Stable: 0.73.2 |
Amazon Redshift | Not supported |
AppDynamics | Beta: 0.70.0-beta04 Stable: 0.73.2 |
Azure Monitor | Beta: 0.71.0-beta |
Azure Monitor managed service for Prometheus | Beta: 0.78.0-beta |
BigQuery | Beta: 0.71.0-beta Stable: 0.73.2 |
Datadog | Beta: 0.70.0-beta04 Stable: 0.73.2 |
Dynatrace | Beta: 0.70.0-beta04 Stable: 0.73.2 |
Elasticsearch | Beta: 0.70.0-beta04 Stable: 0.73.2 |
GCM | Beta: 0.70.0-beta04 Stable: 0.73.2 |
GrafanaLoki | Beta: 0.70.0-beta04 Stable: 0.73.2 |
Graphite | Not supported |
InfluxDB | Not supported |
Instana | Not supported |
LogicMonitor | Beta: 0.76.0-beta |
NewRelic | Beta: 0.70.0-beta04 Stable: 0.73.2 |
OpenTSDB | Not supported |
Pingdom | Beta: 0.71.0-beta Stable: 0.73.2 |
Prometheus | Beta: 0.70.0-beta04 Stable: 0.73.2 |
ServiceNow Cloud Observability | Beta: 0.70.0-beta04 Stable: 0.73.2 |
Splunk | Beta: 0.70.0-beta04 Stable: 0.73.2 |
SplunkObservability | Not supported |
SumoLogic | Beta: 0.73.0-beta Stable: 0.73.2 |
SumoLogic | Beta: 0.73.0-beta Stable: 0.73.2 |
ThousandEyes | Beta: 0.71.0-beta Stable: 0.73.2 |
Key takeawaysβ
Parameter | When to modify? | Other parameters dependency | Impact on Nobl9 features |
---|---|---|---|
Query interval: frequency | β Retrieve fresh data β Reduce the number of requests to the data source | Jitter, timeout β€ query interval | Alerting: long query intervals can delay important alerts |
Query delay: lag | β Get insights and alerts faster β Get aggregated and consolidated data | Jitter, timeout β€ query interval + query delay | Replay: long query delay can affect results Alerting: long query delay can postpone important alerts |
Jitter: randomness | β Spread the load over the query interval duration | Query interval + query delay β₯ jitter | SLOs: short jitter can exhaust rate limits for SLOs with big amount of SLIs |
Timeout: wait time | β The agent reports timeouts | Query interval + query delay β₯ timeout | Replay, SLI analyzer, SLOs: short timeout make it impossible to receive a response |