Threshold and ratio metrics with SLI Analyzer
Choosing the correct reliability target is the greatest challenge in creating meaningful SLOs for your system. There is no one-size-fits-all solution for good SLOs: reliability targets depend on how the system performed in the past.
SLI Analyzer can retrieve up to 30 days of historical data and lets you try out different reliability targets without creating a full SLO.
Tools to help you determine the desired reliability settings include:
-
Statistical data displayed at the top of your analysis tab (e.g.,
Min
,Max
,StdDev
) -
The SLI values distribution chart that shows the frequency distribution of the data points in the given time window.
In this article, we will go over two use cases of adjusting SLO targets using the SLI Analyzer for Threshold and Ratio metrics. Once we’ve adjusted our SLO targets using SLI Analyzer, we can easily create a new SLO from our analysis.
Understanding the historical performance of a service
Let’s assume that we’re an SRE responsible for our application's infrastructure, and we’d like to configure several SLOs on the service to ensure it meets reasonable response times targets.
We decide to use the SLI Analyzer to better understand the historical performance of our system and pick relevant targets.
Use case 1: Analyzing latency (threshold)
Starting the analysis: configure data import
In this example, we're using Datadog, but the overarching concepts are similar for all other observability platforms. We enter the SLI Analyzer tab to create a new analysis. Once the SLI Analysis tab has opened, we select a Datadog data source and choose a threshold metric query that returns the average response time made by the server to the client.
Next, we enter a relevant query (avg:trace.http.request.duration{*}
), and select a 14-day Graph Time Window since we'd like to calculate the Error budget for the last two weeks:
Once we’ve configured all fields in the import data step, we click the Import Data button and wait for the data to be fetched to begin our analysis:
Analyze metric data
Once the data has been imported, we can have a look at the Min
, Max
, Range
, and percentile values in the SLI Analysis tab:
We can also check the SLI values distribution chart, where we see two buckets of values to the left of the chart:
Since their value distribution is wide, we can change the display mode from the Linear to Logarithmic scale to have more meaningful insight into the distribution of our SLI values:
Returning to the statistical data, we can see that sometimes our application’s server responds quickly at around 0.16s
(Min
: 0.16795
). At other times it can take almost 17
seconds (Max
: 17.18875
) to send a response:
We can also note that their Mean
value is around 0.25s
which is a reasonable average response time.
Next, we look at the percentile values and see that their p99
value is slightly below 0.6s
.
Using the value of the 99th percentile,
we set the error budget calculation method to Occurrences with a threshold value of less than 0.6s
:
We click the Analyze button and wait for the calculations to finish. Once they’ve been processed, we can have a look at the Reliability Burn Down Chart and the result of our analysis:
Adjust metric data
In the right-hand side panel of the successfully created SLI analysis, we can see that the percentage of good values is more than 99
and that we still have a margin of the remaining error budget (10%
). Looking at the SLI values distribution chart, we see that we only account for a partial amount of the long tail with this target percentage:
This makes us wonder if we could set an even more conservative threshold.
We can change our target value to 0.5
and see how it will affect our error budget. Once the analysis has been completed, we observe that we’ve run out of our error budget, and we’re slightly off the agreed 99%
target:
Seeing that and reconsidering that the value for the 99th percentile is around 0.58
, we change the values to be less than 0.58
, which leaves us with almost no error budget remaining for the selected Graph Time Window:
Create SLO
Now that we’re content with the last adjustment, we can leverage SLI Analyzer and create a new SLO from our analysis. We can do that by clicking the Create SLO button, which will take us to the SLO wizard with predefined data based on the analysis.
Use case 2: Analyzing successful requests (ratio metric)
SLI Analyzer also allows us to try different reliability targets for the ratio metrics. In the second example, we will use an SLO based on the Datadog source for a ratio metric query that returns the ratio of the sum of HTTP 200 requests to our application’s server to the total sum of all requests:
- Good query:
sum:trace.http.request.hits{http.status_code:200}.as_count()
- Total query:
sum:trace.http.request.hits{*}.as_count()
This time, we will use a 30-day Graph Time Window to scrutinize the performance of our system over the last month:
Analyze ratio metric data
Once data import has been completed, we can see statistical values, SLI Chart, and the SLI distribution Chart for the good and total metric values we selected:
To adjust your Ratio metric, focus on the value of the target (for the Occurrences method) for your metric or the values of the Time slice allowance and the target (for the Time Slices method).
Adjust target
With that in mind, we can now adjust our target. Let’s assume an ambitious goal and check how the Error budget for our SLO will look with the 95%
of good requests target. We hit the Analyze button, and we see that it isn’t a suitable target for this SLO since the percentage of good values revolves around 92.8%
, and with this target, we would end up with over 15h
of burned budget:
Considering that, let’s set our target slightly below the percentage of good values in this time window and see what our error budget would look like. We thus set the target to 92.7%
, and we click the Analyze button again:
With this target, our SLO would be slightly on the “green,” with more than an hour of Error budget remaining.
We can be even more conservative, setting the target to 92.8%
, with the following result:
Create SLO
With this result, we can Create an SLO with the targets adjusted in this analysis.