This guide presents a high-level overview of how SLOs are calculated in the Nobl9 platform. It is relevant to anyone who might want to dive deeper into Nobl9 SLOs and learn about:
Assumptions underlying SLI metrics (best practices/do’s and don’ts)
Assumptions underlying Threshold and Ratio metrics (aka Raw and Count metrics)
The ins and outs of error budget calculations
SLI Metrics - Assumptions
SLI metrics are two-dimensional data sets where value changes are distributed over time (see Image 1 below). This is a broad category, but there's a crucial caveat: SLI metrics can't be constructed from just any type of data.
Consider the following example. Suppose you choose the number of requests logged to your server per hour as your SLI metric. This might be a legitimate metric, but it’s not one that will tell you anything meaningful about the health of your service. It is just a piece of raw data about the traffic on your server per hour. You would not be able to measure the reliability of your service based on this type of input.
So, the most important thing about SLI metrics is that they must be meaningful. Beyond that, there are some other important rules and considerations to keep in mind; the following sections provide an overview.
It is crucial to remember that SLI metrics in Nobl9 are composed of real numbers. There are specific standards that these numbers must adhere to (e.g., the top limit of the range, etc.).
Nobl9 accepts metrics with three data types:
Whenever you send a Boolean metric to Nobl9, it will be treated as a 1 (if the value is true) or a 0 (if the value is false). You can leverage this knowledge when configuring SLOs of Threshold Metric type.
You can use whatever units are most appropriate, depending on your system/service and what you want to measure as an SLI metric. However, remember that the SLO thresholds must be defined in the same units as your metrics. For instance, if your SLI metric is defined as a decimal value, you must use the same unit when defining your SLO threshold (e.g., 0.05 instead of 5%).
In Nobl9, SLI metrics are visualized as a continuous line that fluctuates over time. For example:
However, this is just a simplified convention in Nobl9 to enhance the user experience. In reality, these metrics are discrete and finite collections of data points.
Consequently, metric visualizations in Nobl9 are only approximations of the original metric: they are not ideal reflections of this metric. When calculating SLOs, we assume these points are a good enough approximation of the actual metric.
Data Point Density
Another consequence of this approach is that it is vital to set an appropriate resolution (density) for the data points that aggregate into an SLI metric.
If the density of the data points is too low, Nobl9 will not be able to calculate the metric correctly. The Nobl9 server collects a maximum of 4 data points per minute for most integrations. This resolution is dense enough for correct approximations of SLI metrics. It is not necessary for the accurate approximation of the metric that the data points be distributed evenly, with the caveat described in the following section.
The following table summarizes the expected times for data point collection for different calculation methods of SLI metrics in Nobl9:
|Rolling Time Window||Calendar-Aligned Time Window|
|Occurrences||≤ 1 min||Does not matter|
|Time Slices||≤ 1 min||≤ 1 min|
Sparse Metrics and Metric Accuracy
If there is more than one “empty” minute between two received data points, the accuracy with which Nobl9 approximates your metric will be affected. To ensure that a sparse SLI metric is as accurate as possible, its constituent data points should be distributed as evenly as possible:
In Nobl9, there are two basic SLI metric types:
Threshold metrics (aka Raw metrics) operate based on one time series.
Ratio metrics (aka Count metrics) operate based on two time series: a count of good events and total events.
Threshold (Raw) Metrics
A Threshold metric consists of a single time series where a single value changes over time. Users can define one or more thresholds for this value, using the same units as the Threshold metric.
The threshold target is the lowest acceptable good/total ratio in a given time window for which an objective would be considered "met."
Example of Threshold Calculations
Let's assume you set the following objective: 90% of requests to my platform should take less than 100 ms.
For the Occurrences error budget calculation method (see below), this will be interpreted as "the response time of 90% requests should be below 100 ms in a given time window."
For the Time Slices error budget calculation method, it is interpreted as "out of all minutes in a given time window, 90% of them should have a request latency less than 100 ms."
For a Threshold metric with the
lt (less than) operator, each point below the set threshold value is labeled good (
G), and each point above the threshold value is marked bad (
B) (Image 2). With such a metric, we want to know the exact periods when our metric exceeds the threshold value (the areas marked in red in the image below).
Ratio (Count) Metrics
For every Ratio metric in Nobl9, two data streams are required. Along with the stream of data representing the queried metric (the count of all queried events, represented by the red line in Image 4), Nobl9 receives a second stream of data simultaneously that indicates whether each data point was good or bad. It uses this data to create the count of good events (represented by the blue line in Image 4). Nobl9 then uses the second stream to calculate the error rate over time.
While it is theoretically possible for the good and total counts to be correlated 1:1, the good count cannot possibly exceed the total count of occurrences. Thus, the basic necessary condition for the Ratio metric is
Good < Total, where
Good stands for the count of good events and
Total stands for the count of total events.
The only situation where the count of good events could exceed the count of total events is as a result of a query error, where, for example, data is aggregated too dynamically. To avoid such situations, remember that your query must be:
Meaningful (i.e., tell you something meaningful about your service)
Idempotent (i.e., it can be applied multiple times without changing the result).
Users can provide input to these two streams for Nobl9 to use to calculate their SLOs (time above the threshold, or good to total occurrences ratio). Keep in mind that the good and total queries are arbitrary: it's your responsibility to define them in a meaningful way such that
Good < Total.
A typical example of a Ratio metric is a latency metric for server response: a histogram of good and total requests. Such a histogram is a graphical representation that organizes a group of data points into user-specified ranges (good and bad) with a signal line (threshold value).
In this example, a histogram bar is positive when the metric line is above the dotted threshold line and negative when the metric line is below the threshold line. An increasing histogram indicates an increase in upward momentum of requests, while a decreasing histogram signals downward momentum of requests.
Incremental and Non-Incremental Ratio Metrics
Incremental and non-incremental are two subtypes of the Ratio metric that depend on the method of counting data:
For the incremental method, we expect the value of a metric to be the current sum of some numerator.
For the non-incremental method, we expect it to be the components of the sum.
Example: Let's assume our metric is the number of requests from some HTTP server. If our data for the count of all requests looks like this:
2021-01-01 01:20:00 = 100
2021-01-01 01:21:00 = 230
2021-01-01 01:22:00 = 270
2021-01-01 01:23:00 = 330
with the values continuously increasing, it's a good indicator that this is a an incremental Ratio metric.
If we have the same data in this form:
2021-01-01 01:21:00 = 130
2021-01-01 01:22:00 = 40
2021-01-01 01:23:00 = 60
where the values represent the components of the sum and are not continuously increasing, then the metric is non-incremental.
For details on how to configure your Count/Ratio metric for incremental/non-incremental method, check the YAML Guide.
Error Budget Calculations
Time windows are essential components of time-based error budget calculation methods. To calculate the error budget for your SLO, you need to determine the type of window that will suit your platform best and will provide a good representation of its reliability.
Rolling Time Windows
A rolling window moves (rolls) over time. For every rolling time window, the time is calculated as:
r(t) = begin: t - duration, end: t
Let's assume you’ve set a 30-day time window, and the data resolution in your SLI is 60 seconds. Nobl9 will update your error budget every 60 seconds, and as bad event observations expire beyond that 30-day window they will fall off and will no longer be included in the error budget calculations.
When to use a rolling time window
Rolling time windows give you precise error budgeting for a fixed period of time. They allow you to answer the question, "How did we do in the last n-x days?"
Calendar-Aligned Time Windows
Calendar-aligned windows are bound to exact time points on a calendar. Each data point automatically falls into a fixed, consecutive time window. For example, instead of a 30-day rolling window, you can calculate your error budget starting at the beginning of the week, month, quarter, or even calendar year.
When to use calendar-aligned SLOs
Calendar-aligned time windows enable easier time reporting. This is a good option for monitoring large business metrics or services that are tied to the calendar (e.g., quarterly subscription plans).
The error budget for such time windows is restarted once each calendar window is concluded (e.g., at the end of each month). If there's an outage at the end of the calendar window, such an event will be omitted from the calculations for the new calendar window.
Likewise, an outage at the start of the calendar-aligned time window may consume the entire error budget for that period, and it won't recover until the end of the time window. This method contrasts with using a rolling time window, where you can recover your error budget as the bad points fall out of the window as it moves forward.
Occurrences and Time Slices
Nobl9 supports two methods of calculating error budgets:
Occurrences: This method counts good attempts against the count of total attempts. Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes.
Time Slices: With this method you have a set of intervals of equal length within a defined time period that can each be labeled as good or bad (e.g., good minutes vs. bad minutes). This error budget calculation method measures how many good minutes were achieved (minutes where the system operates within the boundaries defined by the SLI metric) compared to the total minutes in the time window. A bad minute that occurs during a low-traffic period (e.g, in the middle of the night for most of your users, when they are unlikely to notice a performance issue) will have the same effect on the SLO as a bad minute during peak traffic times.