Skip to main content

SLI aggregations in Nobl9

Reasons to aggregate data​

Reading time: 0 minute(s) (0 words)

The metrics of your resources can contain vast amounts of raw data. Considering the Nobl9 agent collects maximum 4 data points per minute for most integrations, it's natural to aggregate this data to match the expected SLI point density.

Use case​

There are many aggregation functions that you can take advantage of when configuring a query for your SLO. To illustrate their benefits, we compare the maximum, average, and median functions. Each one of them provides different insights into the data and has both strengths and weaknesses. The choice depends on the specific insights you want to gain from the data and the specifications of the user journey your SLI is designed to reflect.

Let’s look at the following example.

The diagram below shows an example SLI with the 4-point-per-aggregation windows:

  • 1, 2, 7, 2
  • 8, 2, 2, 1
  • 1, 6, 1, 2
  • 2, 1, 1, 6
  • 1, 8, 1, 2
  • 3, 3, 7, 1
Raw data chart
SLI with the 4-point-per-aggregation windows

Apply aggregation functions on each window and calculate three values: median, max, and average. So we can create three time series:

  • median: 2, 1.5, 1.5, 1.5, 1.5, 3
  • max: 7, 8, 6, 6, 8, 7
  • average: 3, 3, 2.5, 2.5, 3, 3.5

The characteristics of all these time series are quite different. If we apply the same threshold value to them (for example, 3), error budget exhaustion will also differ.

Aggregation functions pros and cons

☺ Pros:

  • Highlights the worst-case scenario that can be critical
  • Helps detect spikes or outliers in latency
  • 


    ☹ Cons:

  • Depends on occasional extreme values that can differ from a typical user experience
  • As it's seen, there is no right or wrong aggregation. The following are key points to consider when selecting between max, median, and average aggregation:

    • Define the focus of your SLO and design your SLI to represent it.
    • Is one aggregation not enough for your purposes? You can set up two different SLIs, for example, maximum latency and average latency. Set up SLOs for both of these SLIs and monitor them separately.
    • Don’t stick to p50 (median). If you focus on the typical user experience, consider higher percentiles, like p95 or p99.
    Example

    An example query that uses avg aggregation in Datadog integration looks like the following:

    avg:trace.http.request.duration{service:web,env:production,!http.url_details.path:/}

    Similarly, p90 (90th percentile) aggregation on the same metric would look like the following:

    p90:trace.http.request.duration{service:web,env:production,!http.url_details.path:/}

    Aggregations are done on the query level, and it depends on the data source requirements.

    Learn more about data sources.

    Count metrics alternative​

    Aggregating data is essential to define raw metric SLOs with many data points. Alternatively, specify the same SLO as a count metrics with two data streams. In this case, the raw metric SLI would be a stream of data points representing users' latency in accessing an application.

    SLI: Average (latency) for 1 minute
    Threshold: 50ms

    So, the budget starts burning once the average latency (within 1 minute) exceeds 50 ms.

    A similar SLO based on the count metrics is as follows:

    SLI:
    Good: The number of requests with latency < 50 ms within 1 minute
    Total: The total of all requests

    This SLO will burn the budget when the number of requests with a latency < 50 ms is less than the total number of requests.

    While both SLOs track users' latency, the count metrics SLO represents user experience more accurately and can suit better if you focus on the accuracy of all events. However, tracking only average or max (or both) latencies can be sufficient for most cases.

    Using percentiles as SLIs​

    You can use percentiles as SLI. For this, define a specific percentile value that matches the level of satisfaction, performance, or behavior you’re aiming for.

    We recommend using the Time slices budgeting method when using percentiles as an SLI. This method sets forth incoming an aggregated data stream.

    Practical guidelines​

    Any stream of data can function as an SLI, providing this data stream features timestamps and values.

    Data stream examples
    • A stream of points representing individual occurrences of an event a customer faced using your service, like logging into the system
    • HTTP 500, 403, or other errors happened over a specified time frame
    • A 95th percentile of service latency

    The streams in the example provide different kinds of information that can refer to the same service.

    ❝ How do I take advantage of that to create meaningful SLOs? And more specifically, can I use percentiles as SLIs? ❞

    While an SLI is a metric that quantitatively measures the performance or behavior of a service, an SLO represents a reliability targetβ€”a specific level of performance that is required or expected. This target clearly distinguishes what performance is good and what is bad.

    In other words, if you can define a threshold that distinguishes good events from bad events based on the provided SLI, you can create a meaningful SLO.

    Error budget calculation methods​

    Nobl9 supports the Occurrences and Time slices error budget calculation methods. Using aggregations as an input stream is allowed for both budgeting methods.

    However, for aggregated SLIs, Time slices provides the maximum clarity since the incoming data is already grouped in a series of events. At the same time, the overall calculations are based on the time function. This way, the key is determining whether a minute is good or bad based on the incoming aggregation.