SLI aggregations in Nobl9
Reasons to aggregate dataβ
The metrics of your resources can contain vast amounts of raw data. Considering the Nobl9 Agent collects a maximum of 4 data points per minute for most integrations, it's natural to aggregate this data to match the expected SLI point density.
Use caseβ
There are many aggregation functions that you can take advantage of when configuring query for your SLO. To illustrate their benefits, we compare the maximum
, average
, and median
functions. Each one of them provides different insights into the data and has both strengths and weaknesses. The choice depends on the specific insights you want to gain from the data and the specifications of the User Journey your SLI is designed to reflect.
Letβs look at the following example.
The diagram below shows an example SLI with the 4-point-per-aggregation windows:
1, 2, 7, 2
8, 2, 2, 1
1, 6, 1, 2
2, 1, 1, 6
1, 8, 1, 2
3, 3, 7, 1
Apply aggregation functions on each window and calculate three values: median
, max
, and average
. So we can create three time series:
median
:2
,1.5
,1.5
,1.5
,1.5
,3
max
:7
,8
,6
,6
,8
,7
average
:3
,3
,2.5
,2.5
,3
,3.5
The Characteristics of all these time series are quite different. If we apply the same threshold value to them (for example, 3), error budget exhaustion will also differ.
- Max
- Median / Percentile
- Average
βΊ Pros:
βΉ Cons:
βΊ Pros:
βΉ Cons:
βΊ Pros:
βΉ Cons:
As it's seen, there is no right or wrong aggregation. The following are key points to consider when selecting between max
, median
, and average
aggregation:
- Define the focus of your SLO and design your SLI to represent it.
- Is one aggregation insufficient for your purposes? You can set up two different SLIs, for example, Max Latency and Average Latency. Set up SLOs for both of these SLIs and monitor them separately.
- Donβt stick to
p50
(median). If you focus on the typical user experience, consider higher percentiles, likep95
orp99
.
An example query that uses avg aggregation in Datadog integration looks like the following:
avg:trace.http.request.duration{service:web,env:production,!http.url_details.path:/}
Similarly, p90 (90th percentile) aggregation on the same metric would look like the following:
p90:trace.http.request.duration{service:web,env:production,!http.url_details.path:/}
Aggregations are done on the query level, and it depends on the integration.
Read more about integration details in Data sources.
Count metrics alternativeβ
Aggregating data is essential to define Raw Metric SLOs with many data points. Alternatively, specify the same SLO as a count metrics with two data streams. In this case, the Raw Metric SLI is a stream of data points representing users' latency in accessing an application.
- SLI: Average (latency) for 1 minute
- Threshold: 50ms
So, the budget starts burning once the average latency (within 1 minute) exceeds 50 ms. A similar SLO based on the count metrics is as follows:
- SLI:
- Good: The number of requests with latency < 50 ms within 1 minute
- Total: The total of all requests
This SLO will burn the budget when the number of requests with a latency < 50 ms is less than the total number of requests.
While both SLOs track users' latency, the count metrics SLO represents user experience more accurately and can suit better if you focus on the accuracy of all events. However, tracking only average
or max
(or both) latencies can be sufficient for most cases.
Using percentiles as SLIsβ
You can use percentiles as SLI. For this, define a specific percentile value that matches the level of satisfaction, performance, or behavior youβre aiming for.
We recommend using the Time Slices calculation method when using percentiles as an SLI. This method sets forth incoming an aggregated data stream.
Practical guidelinesβ
Any stream of data can function as an SLI, assuming this data stream features timestamps and values.
- A stream of points representing individual occurrences of an event a customer faced using your service, like logging into the system
- HTTP 500, 403, or other error count happened over a specified time frame
- A 95th percentile of service latency
The streams in the example provide different kinds of information that can refer to the same service.
β How do I take advantage of that to create meaningful SLOs? And more specifically, can I use percentiles as an SLI? β
While an SLI is a metric that quantitatively measures the performance or behavior of a service, an SLO is a reliability targetβa specific level of performance that is required or expected. This target clearly distinguishes what performance is good and what is bad.
In other words, if you can define a threshold that distinguishes good events from bad events based on the provided SLI, you can create a meaningful SLO.
SLO calculation methodsβ
Nobl9 supports the Occurrences and Time Slices SLO calculation methods. Using aggregations as an input stream is allowed for both calculation methods.
However, the Time Slices method with aggregated SLI provides the maximum clarity since the incoming data is already grouped in a series of events. At the same time, the overall calculations are based on the time function. This way, the key is determining whether a minute is good or bad based on the incoming aggregation.