SLI aggregations
Metrics from your data source often contain large data volumes. To present this effectively in charts, Nobl9 downsamples the data whenever the density within a selected time window is too high to display all data points (raw data).
By default, raw data is displayed when up to 400 data points can be rendered within the time window. For example, narrowing a time window to one hour—where the SLI chart is broken into 15-second intervals (four data points per minute)—generally enables raw data display.
However, in certain scenarios, raw data display requirements differ:
- Dense data may require even narrower time windows
- Sparse data allows for wider time windows show without losing granularity
While raw data offers a straightforward view, downsampling methods are more nuanced. Understanding the applied downsampling methods provides deeper insights into SLI charts.
Nobl9 employs aggregation to downsample data in SLI charts. Data is aggregated over the intervals proportional to the selected time window, meaning wider windows aggregate more data points.
Aggregation varies depending on the metric type and data count method (for ratio metrics).
- Threshold metrics uses percentile aggregation:
min
—a minimum value within each intervalmax
—a maximum value within each intervalp1
–p99
—percentiles of data point values within each interval
- Ratio metrics with the incremental data count method:
- The
last
data point value is shown for each interval
- The
- Ratio metrics with the non-incremental data count method:
- The
sum
of data point values within each interval. Applies for the Events aggregation mode - The
rate
between data point values per second. Applies for the Events/sec aggregation mode
- The
Threshold metrics aggregation
When visualizing threshold metrics,
handling large volumes of data requires efficient aggregation techniques to make sense of trends and anomalies.
Nobl9 uses min
, max
, and p1
–p99
percentiles to aggregate this data.
The max
and min
functions identify the highest and lowest observed values within a given interval.
Percentiles, on the other hand, provide insight into the distribution of values between these extremes, showing how data points are spread across the range.
Your service occasionally experiences latency spikes (a few hours per month).
- You find the maximum acceptable latency of 1–2 seconds to be reasonable.
- You want to evaluate the experience your service provides for at least 95% of users.
- The 95th percentile (
p95
) shows the value of 50ms, indicating that 95% of your users experience latency of no more than 50ms.
This data suggests that most users have a far better experience than your acceptable latency threshold, with only occasional spikes exceeding it.
The choice of percentile depends on the defined objective's operator you select when creating your SLO:
min
,p1
,p5
,p10
,p50
for operators>
and>=
max
,p99
,p95
,p90
,p50
for operators<
and<=
To illustrate calculations, let's take the raw SLI data received over six 4-point-per-aggregation intervals and calculate percentiles for every dataset:
Dataset | min | p1 | p5 | p10 | p50 | p90 | p95 | p99 | max |
---|---|---|---|---|---|---|---|---|---|
1, 2, 7, 2 | 1 | 1 | 1 | 1 | 2 | 7 | 7 | 7 | 7 |
8, 2, 2, 1 | 1 | 1 | 1 | 1 | 2 | 8 | 8 | 8 | 8 |
1, 6, 1, 2 | 1 | 1 | 1 | 1 | 1.5 | 6 | 6 | 6 | 6 |
2, 1, 1, 6 | 1 | 1 | 1 | 1 | 1.5 | 6 | 6 | 6 | 6 |
1, 8, 1, 2 | 1 | 1 | 1 | 1 | 1.5 | 8 | 8 | 8 | 8 |
3, 3, 7, 1 | 1 | 1 | 1 | 1 | 3 | 7 | 7 | 7 | 7 |
Total | 1 | 1 | 1 | 1 | 2 | 7.5 | 8 | 8 | 8 |
With this data, the SLI chart for raw and aggregated data is as follows:
- Raw data
- Percentiles
Let's evaluate the dataset against various objective threshold values and operators, using percentile analysis to understand their impact on error budgets.
- Analysis of greater than thresholds
- Analysis of less than thresholds
Threshold | Impact (values burning budget) | Percentile context | Conclusion |
---|---|---|---|
>3 | ≤3 | 50% of the values (p50 ) is 2 or less | Burns excessive budget (most values are below) |
>=1 | <1 | All values ≥ 1 | Too lenient (no values burn budget) |
Threshold | Impact (values burning budget) | Percentile context | Conclusion |
---|---|---|---|
<=2 | >2 | 50% of the values (p50 ) is 2 or less | Excessive burn (a third or more would exceed) |
<=7.5 | >7.5 | 90% of the values (p90 ) is 7.5 or less | Good sensitivity to spikes with reasonable tolerance |
<10 | ≥10 | All values <10 | Too lenient |
Ratio metric aggregation
In ratio metrics, downsampled data is also aggregated in intervals, according to the selected time window. Aggregation methods in ratio metrics depend on the data count method you selected when creating your SLO.
Data count method | Aggregation | Description |
---|---|---|
Incremental | last | The last collected data point in an interval |
Non-incremental | sum | The total of all data point values over the interval |
Non-incremental | rate | The per-second rate for data point values collected over the interval Most accurate when the total number of data points is constant |
Let's take the raw SLI data (in a good-over-total ratio metric) received over six 4-point-per-aggregation intervals is as follows:
Good, incremental | Total, incremental | Good, non-incremental | Total, non-incremental |
---|---|---|---|
3 , 3 , 7 , 9 | 4 , 5 , 10 , 13 | 1 , 2 , 7 , 2 | 4 , 5 , 10 , 5 |
15 , 16 , 17 , 23 | 16 , 17 , 24 , 29 | 8 , 2 , 2 , 1 | 11 , 5 , 5 , 4 |
23 , 27 , 35 , 38 | 33 , 37 , 38 , 39 | 1 , 6 , 1 , 2 | 4 , 9 , 4 , 5 |
41 , 44 , 60 , 68 | 41 , 44 , 60 , 68 | 2 , 1 , 1 , 6 | 5 , 4 , 4 , 9 |
70 , 73 , 74 , 77 | 70 , 73 , 75 , 79 | 1 , 8 , 1 , 2 | 4 , 11 , 4 , 5 |
78 , 86 , 94 , 102 | 80 , 86 , 94 , 102 | 3 , 3 , 7 , 1 | 6 , 6 , 10 , 4 |
Ratio metric SLI charts also display a delta data stream—it refers to the difference between a total and good point values.
- Incremental data count method
- Non-incremental data count method
Incremental data count method fits best when every next data point is equal to the previous or is greater than it. The raw data values on the SLI chart are displayed as follows:
For downsampled data, the last data point is displayed in the SLI chart:
Analyzing the delta trend line in an incremental ratio metric provides insights into error budget changes.
Delta line trend | Good over total ratio | Bad over total ratio |
---|---|---|
Horizontal line | No new bad events. The burn rate is 0 | All new events are bad. The burn rate is above zero |
Ascending line | New bad events. The burn rate is above zero | Some new bad events. The burn rate is above zero |
Descending line | Potential query misconfiguration, if counters in the data source are not reset | Potential query misconfiguration, if counters in the data source are not reset |
Zero | Same as a horizontal line. No bad events observed since the last counter reset in the data source | Same as a horizontal line. No good events observed since the last counter reset in the data source |
Below zero | A potential query misconfiguration | A potential query misconfiguration |
Non-incremental data count method is suitable for varying data. The raw data values on the SLI chart are displayed as follows:
Non-incremental ratio metric data is downsampled using either the sum
or rate
methods, depending on the selected mode.
For the Events mode, the sum
aggregation is applied—all data points received over an interval are summed:
The Events/sec mode uses the rate
aggregation and shows the proportion of data points per second in an interval:
The Events/sec mode assumes that the data stream is continuous and does not account for possible data gaps. As a result, this mode is most informative and accurate when the point resolution remains constant and there are no interruptions or gaps in the data stream.
Threshold and ratio functional symmetry and combination
You can monitor the same metric, for example, latency, with both threshold and ratio SLOs for different operational goals.
SLO type | Operational goal | SLI | Budget burn trigger | Usage |
---|---|---|---|---|
Threshold | Ensure the average latency per minute stays below a defined threshold (50 ms ) | Average latency for all requests in 1 minute | Average latency exceeds 50 ms | Quickly spotting systemic issues, such as server-side processing delays or network slowdowns |
Ratio | Track how many requests meet latency expectations (<50 ms ) compared to the total number of requests | Ratio of requests with latency <50 ms to the total number of requests | Ratio drops below a fixed limit (e.g., <95% ) | Identifying user-experience degradation, as it focuses on the proportion of "good" experiences instead of averages, which could mask outlier effects |
The example reveals the following practical implementations:
- Use a threshold SLO to diagnose internal system performance—breaks in threshold help reveal widespread or systemic slowness.
- Use a ratio SLO to prioritize user experience or SLIs that directly reflect meeting user satisfaction goals.
- Monitor the same metric with both SLOs to get a comprehensive view of system operation and how issues impact end users.
Using percentiles as SLIs
You can use percentiles as SLI to monitor metrics at specific thresholds of performance. For this, define the required percentile value that matches the level of satisfaction, performance, or behavior you’re aiming at. For example:
- 95th percentile latency: 95% of requests complete within a target time
- 99th percentile availability: a service is available for 99% of the time
When using percentiles as SLIs, we recommend the time slices budgeting method. This method divides the SLO time window into short slices and evaluates them independently against the target, marking minutes good or bad. If any time slice fails to meet the percentile SLI requirements, the error budget burns.
Any stream of data with a timestamp and value can function as an SLI. For example:
- A stream of points representing individual occurrences of an event a customer faced using your service, like logging into the system
- HTTP 500, 403, or other errors happened over a specified time frame
- A
95th
percentile of service latency
The streams in the example provide different kinds of information that can refer to the same service.
SLI differences between Nobl9 and a data source
Nobl9 attempts to gather four data points every minute from a data source. This is essentially the target or desired rate of data collection.
The actual number of data points coming from a data source might be different—sometimes it can be less than four or more than four. This variability depends on:
- The way the query is written—for example, if it requests data at particular intervals.
- The data source configuration—how frequently the data source is updated or how it structures metrics.
Because an exact four data points per minute might not always be available, the calculated SLI in Nobl9 becomes an approximation of the true metric. In other words, the calculations are based on whatever data points Nobl9 manages to gather, even if the count differs from four.
The term raw in Nobl9 denotes unmodified data points received from the Nobl9 agent. No downsampling is applied to these points when visualized in the platform. As a result, you see the exact data points your data source sent to Nobl9.
Copying and pasting the same query in an SLO and in your data source might yield differing outputs. These discrepancies can stem from factors such as:
- Time alignment: When exactly the metrics are captured and how they are timestamped.
- Data sampling intervals: The interval at which the data source records or returns metrics.
- Data source internals: Any processing, buffering, or caching that the data source does before sending data.
- Eventual consistency: some data may not be immediately available in the data source when the Nobl9 agent requests it. The agent does not automatically backfill missing data (you can run Replay for this).
Click to unfold examples
-
Suppose your data source updates a counter metric (for example, the total number of HTTP requests) twice per minute. In this scenario, Nobl9 still attempts to collect four readings. As a result, it might retrieve the first pair of points representing the counter value after the first update and the second pair of points representing the counter value after the second update. Consequently, the SLI will have four points per minute, while the same counter in the data source will show only two points per minute.
-
Consider a highly frequent data stream coming from a service that logs data every few seconds. In this case, you might have eight or ten data points per minute. Every minute, the Nobl9 agent attempts to fetch four points (for SLO calculations), effectively reducing the point density compared to the data source. SLO calculations are therefore performed using four points per minute instead of the full number of points available in the data source.
-
You can copy the same query Nobl9 uses and run it directly on your data source’s monitoring dashboard. If that dashboard fetches data at slightly different time boundaries, you may notice a change in the results, even for the same time frame. This is because the data source might return data after additional processing, downsampling, or caching, which was not included in the data retrieved by the Nobl9 agent previously.