Skip to main content

SLI aggregations

Reading time: 0 minute(s) (0 words)

Metrics from your data source often contain large data volumes. To present this effectively in charts, Nobl9 downsamples the data whenever the density within a selected time window is too high to display all data points (raw data).

By default, raw data is displayed when up to 400 data points can be rendered within the time window. For example, narrowing a time window to one hour—where the SLI chart is broken into 15-second intervals (four data points per minute)—generally enables raw data display.

However, in certain scenarios, raw data display requirements differ:

  • Dense data may require even narrower time windows
  • Sparse data allows for wider time windows show without losing granularity

While raw data offers a straightforward view, downsampling methods are more nuanced. Understanding the applied downsampling methods provides deeper insights into SLI charts.

Nobl9 employs aggregation to downsample data in SLI charts. Data is aggregated over the intervals proportional to the selected time window, meaning wider windows aggregate more data points.

Aggregation varies depending on the metric type and data count method (for ratio metrics).

  • Threshold metrics uses percentile aggregation:
    • min—a minimum value within each interval
    • max—a maximum value within each interval
    • p1p99—percentiles of data point values within each interval
  • Ratio metrics with the incremental data count method:
    • The last data point value is shown for each interval
  • Ratio metrics with the non-incremental data count method:
    • The sum of data point values within each interval. Applies for the Events aggregation mode
    • The rate between data point values per second. Applies for the Events/sec aggregation mode

Threshold metrics aggregation

When visualizing threshold metrics, handling large volumes of data requires efficient aggregation techniques to make sense of trends and anomalies. Nobl9 uses min, max, and p1p99 percentiles to aggregate this data.

The max and min functions identify the highest and lowest observed values within a given interval. Percentiles, on the other hand, provide insight into the distribution of values between these extremes, showing how data points are spread across the range.

For example

Your service occasionally experiences latency spikes (a few hours per month).

  • You find the maximum acceptable latency of 1–2 seconds to be reasonable.
  • You want to evaluate the experience your service provides for at least 95% of users.
  • The 95th percentile (p95) shows the value of 50ms, indicating that 95% of your users experience latency of no more than 50ms.

This data suggests that most users have a far better experience than your acceptable latency threshold, with only occasional spikes exceeding it.

The choice of percentile depends on the defined objective's operator you select when creating your SLO:

  • min, p1, p5, p10, p50 for operators > and >=
  • max, p99, p95, p90, p50 for operators < and <=

To illustrate calculations, let's take the raw SLI data received over six 4-point-per-aggregation intervals and calculate percentiles for every dataset:

Datasetminp1p5p10p50p90p95p99max
1, 2, 7, 2111127777
8, 2, 2, 1111128888
1, 6, 1, 211111.56666
2, 1, 1, 611111.56666
1, 8, 1, 211111.58888
3, 3, 7, 1111137777
Total111127.5888

With this data, the SLI chart for raw and aggregated data is as follows:

Raw data chart
Raw SLI data received over six 4-point-per-aggregation intervals

Let's evaluate the dataset against various objective threshold values and operators, using percentile analysis to understand their impact on error budgets.

ThresholdImpact (values burning budget)Percentile contextConclusion
>3≤350% of the values (p50) is 2 or lessBurns excessive budget (most values are below)
>=1<1All values ≥ 1Too lenient (no values burn budget)

Ratio metric aggregation

In ratio metrics, downsampled data is also aggregated in intervals, according to the selected time window. Aggregation methods in ratio metrics depend on the data count method you selected when creating your SLO.

Data count methodAggregationDescription
IncrementallastThe last collected data point in an interval
Non-incrementalsumThe total of all data point values over the interval
Non-incrementalrateThe per-second rate for data point values collected over the interval
Most accurate when the total number of data points is constant

Let's take the raw SLI data (in a good-over-total ratio metric) received over six 4-point-per-aggregation intervals is as follows:

Good, incrementalTotal, incrementalGood, non-incrementalTotal, non-incremental
3, 3, 7, 94, 5, 10, 131, 2, 7, 24, 5, 10, 5
15, 16, 17, 2316, 17, 24, 298, 2, 2, 111, 5, 5, 4
23, 27, 35, 3833, 37, 38, 391, 6, 1, 24, 9, 4, 5
41, 44, 60, 6841, 44, 60, 682, 1, 1, 65, 4, 4, 9
70, 73, 74, 7770, 73, 75, 791, 8, 1, 24, 11, 4, 5
78, 86, 94, 10280, 86, 94, 1023, 3, 7, 16, 6, 10, 4

Ratio metric SLI charts also display a delta data stream—it refers to the difference between a total and good point values.

Incremental data count method fits best when every next data point is equal to the previous or is greater than it. The raw data values on the SLI chart are displayed as follows:

last aggregation
Raw SLI data received over six 4-point-per-aggregation intervals

For downsampled data, the last data point is displayed in the SLI chart:

last aggregation
Downsampled SLI data: "last" aggregation method

Analyzing the delta trend line in an incremental ratio metric provides insights into error budget changes.

Delta line trendGood over total ratioBad over total ratio
Horizontal lineNo new bad events. The burn rate is 0All new events are bad. The burn rate is above zero
Ascending lineNew bad events. The burn rate is above zeroSome new bad events. The burn rate is above zero
Descending linePotential query misconfiguration, if counters in the data source are not resetPotential query misconfiguration, if counters in the data source are not reset
ZeroSame as a horizontal line. No bad events observed since the last counter reset in the data sourceSame as a horizontal line. No good events observed since the last counter reset in the data source
Below zeroA potential query misconfigurationA potential query misconfiguration

Threshold and ratio functional symmetry and combination

You can monitor the same metric, for example, latency, with both threshold and ratio SLOs for different operational goals.

SLO typeOperational goalSLIBudget burn triggerUsage
ThresholdEnsure the average latency per minute stays below a defined threshold (50 ms)Average latency for all requests in 1 minuteAverage latency exceeds 50 msQuickly spotting systemic issues, such as server-side processing delays or network slowdowns
RatioTrack how many requests meet latency expectations (<50 ms) compared to the total number of requestsRatio of requests with latency <50 ms to the total number of requestsRatio drops below a fixed limit (e.g., <95%)Identifying user-experience degradation, as it focuses on the proportion of "good" experiences instead of averages, which could mask outlier effects

The example reveals the following practical implementations:

  • Use a threshold SLO to diagnose internal system performance—breaks in threshold help reveal widespread or systemic slowness.
  • Use a ratio SLO to prioritize user experience or SLIs that directly reflect meeting user satisfaction goals.
  • Monitor the same metric with both SLOs to get a comprehensive view of system operation and how issues impact end users.

Using percentiles as SLIs

You can use percentiles as SLI to monitor metrics at specific thresholds of performance. For this, define the required percentile value that matches the level of satisfaction, performance, or behavior you’re aiming at. For example:

  • 95th percentile latency: 95% of requests complete within a target time
  • 99th percentile availability: a service is available for 99% of the time

When using percentiles as SLIs, we recommend the time slices budgeting method. This method divides the SLO time window into short slices and evaluates them independently against the target, marking minutes good or bad. If any time slice fails to meet the percentile SLI requirements, the error budget burns.

SLI examples

Any stream of data with a timestamp and value can function as an SLI. For example:

  • A stream of points representing individual occurrences of an event a customer faced using your service, like logging into the system
  • HTTP 500, 403, or other errors happened over a specified time frame
  • A 95th percentile of service latency

The streams in the example provide different kinds of information that can refer to the same service.

SLI differences between Nobl9 and a data source

Nobl9 attempts to gather four data points every minute from a data source. This is essentially the target or desired rate of data collection.

The actual number of data points coming from a data source might be different—sometimes it can be less than four or more than four. This variability depends on:

  • The way the query is written—for example, if it requests data at particular intervals.
  • The data source configuration—how frequently the data source is updated or how it structures metrics.

Because an exact four data points per minute might not always be available, the calculated SLI in Nobl9 becomes an approximation of the true metric. In other words, the calculations are based on whatever data points Nobl9 manages to gather, even if the count differs from four.

The term raw in Nobl9 denotes unmodified data points received from the Nobl9 agent. No downsampling is applied to these points when visualized in the platform. As a result, you see the exact data points your data source sent to Nobl9.

Copying and pasting the same query in an SLO and in your data source might yield differing outputs. These discrepancies can stem from factors such as:

  • Time alignment: When exactly the metrics are captured and how they are timestamped.
  • Data sampling intervals: The interval at which the data source records or returns metrics.
  • Data source internals: Any processing, buffering, or caching that the data source does before sending data.
  • Eventual consistency: some data may not be immediately available in the data source when the Nobl9 agent requests it. The agent does not automatically backfill missing data (you can run Replay for this).
Click to unfold examples
  • Suppose your data source updates a counter metric (for example, the total number of HTTP requests) twice per minute. In this scenario, Nobl9 still attempts to collect four readings. As a result, it might retrieve the first pair of points representing the counter value after the first update and the second pair of points representing the counter value after the second update. Consequently, the SLI will have four points per minute, while the same counter in the data source will show only two points per minute.

  • Consider a highly frequent data stream coming from a service that logs data every few seconds. In this case, you might have eight or ten data points per minute. Every minute, the Nobl9 agent attempts to fetch four points (for SLO calculations), effectively reducing the point density compared to the data source. SLO calculations are therefore performed using four points per minute instead of the full number of points available in the data source.

  • You can copy the same query Nobl9 uses and run it directly on your data source’s monitoring dashboard. If that dashboard fetches data at slightly different time boundaries, you may notice a change in the results, even for the same time frame. This is because the data source might return data after additional processing, downsampling, or caching, which was not included in the data retrieved by the Nobl9 agent previously.

For a more in-depth look, consult additional resources: