SLI aggregations

Reading time: 0 minute(s) (0 words)

Metrics from your data source often contain large data volumes. To present this effectively in charts, Nobl9 downsamples the data whenever the density within a selected time window is too high to display all data points (raw data).

By default, raw data is displayed when up to 400 data points can be rendered within the time window. For example, narrowing a time window to one hour—where the SLI chart is broken into 15-second intervals (four data points per minute)—generally enables raw data display.

However, in certain scenarios, raw data display requirements differ:

Dense data may require even narrower time windows
Sparse data allows for wider time windows show without losing granularity

While raw data offers a straightforward view, downsampling methods are more nuanced. Understanding the applied downsampling methods provides deeper insights into SLI charts.

Nobl9 employs aggregation to downsample data in SLI charts. Data is aggregated over the intervals proportional to the selected time window, meaning wider windows aggregate more data points.

Aggregation varies depending on the metric type and data count method (for ratio metrics).

Threshold metrics uses percentile aggregation:
- min—a minimum value within each interval
- max—a maximum value within each interval
- p1–p99—percentiles of data point values within each interval
Ratio metrics with the incremental data count method:
- The last data point value is shown for each interval
Ratio metrics with the non-incremental data count method:
- The sum of data point values within each interval. Applies for the Events aggregation mode
- The rate between data point values per second. Applies for the Events/sec aggregation mode

Threshold metrics aggregation

When visualizing threshold metrics, handling large volumes of data requires efficient aggregation techniques to make sense of trends and anomalies. Nobl9 uses min, max, and p1–p99 percentiles to aggregate this data.

The max and min functions identify the highest and lowest observed values within a given interval. Percentiles, on the other hand, provide insight into the distribution of values between these extremes, showing how data points are spread across the range.

For example

Your service occasionally experiences latency spikes (a few hours per month).

You find the maximum acceptable latency of 1–2 seconds to be reasonable.
You want to evaluate the experience your service provides for at least 95% of users.
The 95th percentile (p95) shows the value of 50ms, indicating that 95% of your users experience latency of no more than 50ms.

This data suggests that most users have a far better experience than your acceptable latency threshold, with only occasional spikes exceeding it.

The choice of percentile depends on the defined objective's operator you select when creating your SLO:

min, p1, p5, p10, p50 for operators > and >=
max, p99, p95, p90, p50 for operators < and <=

To illustrate calculations, let's take the raw SLI data received over six 4-point-per-aggregation intervals and calculate percentiles for every dataset:

Dataset	`min`	`p1`	`p5`	`p10`	`p50`	`p90`	`p95`	`p99`	`max`
`1, 2, 7, 2`	1	1	1	1	2	7	7	7	7
`8, 2, 2, 1`	1	1	1	1	2	8	8	8	8
`1, 6, 1, 2`	1	1	1	1	1.5	6	6	6	6
`2, 1, 1, 6`	1	1	1	1	1.5	6	6	6	6
`1, 8, 1, 2`	1	1	1	1	1.5	8	8	8	8
`3, 3, 7, 1`	1	1	1	1	3	7	7	7	7
Total	1	1	1	1	2	7.5	8	8	8

With this data, the SLI chart for raw and aggregated data is as follows:

Raw data
Percentiles

Raw data chart — Raw SLI data received over six 4-point-per-aggregation intervals

Let's evaluate the dataset against various objective threshold values and operators, using percentile analysis to understand their impact on error budgets.

Analysis of greater than thresholds
Analysis of less than thresholds

Threshold	Impact (values burning budget)	Percentile context	Conclusion
`>3`	`≤3`	50% of the values (`p50`) is 2 or less	Burns excessive budget (most values are below)
`>=1`	`<1`	All values `≥ 1`	Too lenient (no values burn budget)

Threshold	Impact (values burning budget)	Percentile context	Conclusion
`<=2`	`>2`	50% of the values (`p50`) is 2 or less	Excessive burn (a third or more would exceed)
`<=7.5`	`>7.5`	90% of the values (`p90`) is 7.5 or less	Good sensitivity to spikes with reasonable tolerance
`<10`	`≥10`	All values `<10`	Too lenient

Ratio metric aggregation

In ratio metrics, downsampled data is also aggregated in intervals, according to the selected time window. Aggregation methods in ratio metrics depend on the data count method you selected when creating your SLO.

Data count method	Aggregation	Description
Incremental	`last`	The last collected data point in an interval
Non-incremental	`sum`	The total of all data point values over the interval
Non-incremental	`rate`	The per-second rate for data point values collected over the interval Most accurate when the total number of data points is constant

Let's take the raw SLI data (in a good-over-total ratio metric) received over six 4-point-per-aggregation intervals is as follows:

Good, incremental	Total, incremental	Good, non-incremental	Total, non-incremental
`3`, `3`, `7`, `9`	`4`, `5`, `10`, `13`	`1`, `2`, `7`, `2`	`4`, `5`, `10`, `5`
`15`, `16`, `17`, `23`	`16`, `17`, `24`, `29`	`8`, `2`, `2`, `1`	`11`, `5`, `5`, `4`
`23`, `27`, `35`, `38`	`33`, `37`, `38`, `39`	`1`, `6`, `1`, `2`	`4`, `9`, `4`, `5`
`41`, `44`, `60`, `68`	`41`, `44`, `60`, `68`	`2`, `1`, `1`, `6`	`5`, `4`, `4`, `9`
`70`, `73`, `74`, `77`	`70`, `73`, `75`, `79`	`1`, `8`, `1`, `2`	`4`, `11`, `4`, `5`
`78`, `86`, `94`, `102`	`80`, `86`, `94`, `102`	`3`, `3`, `7`, `1`	`6`, `6`, `10`, `4`

Ratio metric SLI charts also display a delta data stream—it refers to the difference between a total and good point values.

Incremental data count method
Non-incremental data count method

Incremental data count method fits best when every next data point is equal to the previous or is greater than it. The raw data values on the SLI chart are displayed as follows:

last aggregation — Raw SLI data received over six 4-point-per-aggregation intervals

For downsampled data, the last data point is displayed in the SLI chart:

Analyzing the delta trend line in an incremental ratio metric provides insights into error budget changes.

Delta line trend	Good over total ratio	Bad over total ratio
Horizontal line	No new bad events. The burn rate is 0	All new events are bad. The burn rate is above zero
Ascending line	New bad events. The burn rate is above zero	Some new bad events. The burn rate is above zero
Descending line	Potential query misconfiguration, if counters in the data source are not reset	Potential query misconfiguration, if counters in the data source are not reset
Zero	Same as a horizontal line. No bad events observed since the last counter reset in the data source	Same as a horizontal line. No good events observed since the last counter reset in the data source
Below zero	A potential query misconfiguration	A potential query misconfiguration

Non-incremental data count method is suitable for varying data. The raw data values on the SLI chart are displayed as follows:

Non-incremental ratio metric data is downsampled using either the sum or rate methods, depending on the selected mode.

For the Events mode, the sum aggregation is applied—all data points received over an interval are summed:

The Events/sec mode uses the rate aggregation and shows the proportion of data points per second in an interval:

The Events/sec mode assumes that the data stream is continuous and does not account for possible data gaps. As a result, this mode is most informative and accurate when the point resolution remains constant and there are no interruptions or gaps in the data stream.

Additionally, under a non-incremental ratio metric SLI chart, you can see how many good (or bad) and total events your SLO received over the selected time window.

Threshold and ratio functional symmetry and combination

You can monitor the same metric, for example, latency, with both threshold and ratio SLOs for different operational goals.

SLO type	Operational goal	SLI	Budget burn trigger	Usage
Threshold	Ensure the average latency per minute stays below a defined threshold (`50 ms`)	Average latency for all requests in 1 minute	Average latency exceeds 50 ms	Quickly spotting systemic issues, such as server-side processing delays or network slowdowns
Ratio	Track how many requests meet latency expectations (`<50 ms`) compared to the total number of requests	Ratio of requests with latency `<50 ms` to the total number of requests	Ratio drops below a fixed limit (e.g., `<95%`)	Identifying user-experience degradation, as it focuses on the proportion of "good" experiences instead of averages, which could mask outlier effects

The example reveals the following practical implementations:

Use a threshold SLO to diagnose internal system performance—breaks in threshold help reveal widespread or systemic slowness.
Use a ratio SLO to prioritize user experience or SLIs that directly reflect meeting user satisfaction goals.
Monitor the same metric with both SLOs to get a comprehensive view of system operation and how issues impact end users.

Using percentiles as SLIs

You can use percentiles as SLI to monitor metrics at specific thresholds of performance. For this, define the required percentile value that matches the level of satisfaction, performance, or behavior you’re aiming at. For example:

95th percentile latency: 95% of requests complete within a target time
99th percentile availability: a service is available for 99% of the time

When using percentiles as SLIs, we recommend the time slices budgeting method. This method divides the SLO time window into short slices and evaluates them independently against the target, marking minutes good or bad. If any time slice fails to meet the percentile SLI requirements, the error budget burns.

SLI examples

Any stream of data with a timestamp and value can function as an SLI. For example:

A stream of points representing individual occurrences of an event a customer faced using your service, like logging into the system
HTTP 500, 403, or other errors happened over a specified time frame
A 95th percentile of service latency

The streams in the example provide different kinds of information that can refer to the same service.

SLI differences between Nobl9 and a data source

Nobl9 attempts to gather four data points every minute from a data source. This is essentially the target or desired rate of data collection.

The actual number of data points coming from a data source might be different—sometimes it can be less than four or more than four. This variability depends on:

The way the query is written—for example, if it requests data at particular intervals.
The data source configuration—how frequently the data source is updated or how it structures metrics.

Because an exact four data points per minute might not always be available, the calculated SLI in Nobl9 becomes an approximation of the true metric. In other words, the calculations are based on whatever data points Nobl9 manages to gather, even if the count differs from four.

The term raw in Nobl9 denotes unmodified data points received from the Nobl9 agent. No downsampling is applied to these points when visualized in the platform. As a result, you see the exact data points your data source sent to Nobl9.

Copying and pasting the same query in an SLO and in your data source might yield differing outputs. These discrepancies can stem from factors such as:

Time alignment: When exactly the metrics are captured and how they are timestamped.
Data sampling intervals: The interval at which the data source records or returns metrics.
Data source internals: Any processing, buffering, or caching that the data source does before sending data.
Eventual consistency: some data may not be immediately available in the data source when the Nobl9 agent requests it. The agent does not automatically backfill missing data (you can run Replay for this).

Click to unfold examples

Suppose your data source updates a counter metric (for example, the total number of HTTP requests) twice per minute. In this scenario, Nobl9 still attempts to collect four readings. As a result, it might retrieve the first pair of points representing the counter value after the first update and the second pair of points representing the counter value after the second update. Consequently, the SLI will have four points per minute, while the same counter in the data source will show only two points per minute.
Consider a highly frequent data stream coming from a service that logs data every few seconds. In this case, you might have eight or ten data points per minute. Every minute, the Nobl9 agent attempts to fetch four points (for SLO calculations), effectively reducing the point density compared to the data source. SLO calculations are therefore performed using four points per minute instead of the full number of points available in the data source.
You can copy the same query Nobl9 uses and run it directly on your data source’s monitoring dashboard. If that dashboard fetches data at slightly different time boundaries, you may notice a change in the results, even for the same time frame. This is because the data source might return data after additional processing, downsampling, or caching, which was not included in the data retrieved by the Nobl9 agent previously.

Useful links

For a more in-depth look, consult additional resources:

Service level objectivesOverview

SLI AnalyzerSLO creation aid

ReplaySLO creation and backfill aid

SLO troubleshootingTroubleshooting

Threshold metrics aggregation​

Ratio metric aggregation​

Threshold and ratio functional symmetry and combination​

Using percentiles as SLIs​

SLI differences between Nobl9 and a data source​

Useful links​