Skip to main content

Threshold and ratio metrics with SLI Analyzer

Reading time: 0 minute(s) (0 words)

Choosing the correct reliability target is the greatest challenge in creating meaningful SLOs for your system. There is no one-size-fits-all solution for good SLOs: reliability targets depend on how the system performed in the past.

SLI Analyzer can retrieve up to 30 days of historical data and lets you try out different reliability targets without creating a full SLO.

Tools to help you determine the desired reliability settings include:

  1. Statistical data displayed at the top of your analysis tab (e.g., Min, Max, StdDev)

  2. The SLI values distribution chart that shows the frequency distribution of the data points in the given time window.

In this article, we will go over two use cases of adjusting SLO targets using the SLI Analyzer for Threshold and Ratio metrics. Once we’ve adjusted our SLO targets using SLI Analyzer, we can easily create a new SLO from our analysis.

Understanding the historical performance of a service

Let’s assume that we’re an SRE responsible for our application's infrastructure, and we’d like to configure several SLOs on the service to ensure it meets reasonable response times targets.

We decide to use the SLI Analyzer to better understand the historical performance of our system and pick relevant targets.

Use case 1: Analyzing latency (threshold)

Starting the analysis: configure data import

In this example, we're using Datadog, but the overarching concepts are similar for all other observability platforms. We enter the SLI Analyzer tab to create a new analysis. Once the SLI Analysis tab has opened, we select a Datadog data source and choose a threshold metric query that returns the average response time made by the server to the client.

Next, we enter a relevant query (avg:trace.http.request.duration{*}), and select a 14-day Graph Time Window since we'd like to calculate the Error budget for the last two weeks:

configure import
Image 1: Configuring data import

Once we’ve configured all fields in the import data step, we click the Import Data button and wait for the data to be fetched to begin our analysis:

importing data
Image 2: Importing data

Analyze metric data

Once the data has been imported, we can have a look at the Min, Max, Range, and percentile values in the SLI Analysis tab:

raw import data
Image 3: Complete data import

We can also check the SLI values distribution chart, where we see two buckets of values to the left of the chart:

linear scale
Image 4: SLI values distribution - linear scale

Since their value distribution is wide, we can change the display mode from the Linear to Logarithmic scale to have more meaningful insight into the distribution of our SLI values:

log scale
Image 5: SLI values distribution - logarithmic scale

Returning to the statistical data, we can see that sometimes our application’s server responds quickly at around 0.16s (Min: 0.16795). At other times it can take almost 17 seconds (Max: 17.18875) to send a response:

statistical data threshold
Image 6: Leveraging statistical data

We can also note that their Mean value is around 0.25s which is a reasonable average response time. Next, we look at the percentile values and see that their p99 value is slightly below 0.6s. Using the value of the 99th percentile, we set the error budget calculation method to Occurrences with a threshold value of less than 0.6s:

target 0.6
Image 7: Configuring analysis - Occurrences method

We click the Analyze button and wait for the calculations to finish. Once they’ve been processed, we can have a look at the Reliability Burn Down Chart and the result of our analysis:

burn down chart
Image 8: Analysis complete - Reliability Burn Down chart

Adjust metric data

In the right-hand side panel of the successfully created SLI analysis, we can see that the percentage of good values is more than 99 and that we still have a margin of the remaining error budget (10%). Looking at the SLI values distribution chart, we see that we only account for a partial amount of the long tail with this target percentage:

0.6 threshold
Image 9: Data for 0.6 target values

This makes us wonder if we could set an even more conservative threshold.

We can change our target value to 0.5 and see how it will affect our error budget. Once the analysis has been completed, we observe that we’ve run out of our error budget, and we’re slightly off the agreed 99% target:

0.5 threshold
Image 10: Adjusting metrics

Seeing that and reconsidering that the value for the 99th percentile is around 0.58, we change the values to be less than 0.58, which leaves us with almost no error budget remaining for the selected Graph Time Window:

0.58 threshold
Image 11: Adjusting metrics continued

Create SLO

Now that we’re content with the last adjustment, we can leverage SLI Analyzer and create a new SLO from our analysis. We can do that by clicking the Create SLO button, which will take us to the SLO wizard with predefined data based on the analysis.

Use case 2: Analyzing successful requests (ratio metric)

SLI Analyzer also allows us to try different reliability targets for the ratio metrics. In the second example, we will use an SLO based on the Datadog source for a ratio metric query that returns the ratio of the sum of HTTP 200 requests to our application’s server to the total sum of all requests:

  • Good query: sum:trace.http.request.hits{http.status_code:200}.as_count()
  • Total query: sum:trace.http.request.hits{*}.as_count()

This time, we will use a 30-day Graph Time Window to scrutinize the performance of our system over the last month:

ratio import
Image 12: Configuring data import for a Ratio metric

Analyze ratio metric data

Once data import has been completed, we can see statistical values, SLI Chart, and the SLI distribution Chart for the good and total metric values we selected:

raw import ratio
Image 13: Begining an SLI analysis for a Ratio metric

To adjust your Ratio metric, focus on the value of the target (for the Occurrences method) for your metric or the values of the Time slice allowance and the target (for the Time Slices method).

Adjust target

With that in mind, we can now adjust our target. Let’s assume an ambitious goal and check how the Error budget for our SLO will look with the 95% of good requests target. We hit the Analyze button, and we see that it isn’t a suitable target for this SLO since the percentage of good values revolves around 92.8%, and with this target, we would end up with over 15h of burned budget:

ratio 95% target
Image 14: SLI analysis for a 95% target

Considering that, let’s set our target slightly below the percentage of good values in this time window and see what our error budget would look like. We thus set the target to 92.7%, and we click the Analyze button again:

92.7 target
Image 15: SLI analysis for a 92.7% target

With this target, our SLO would be slightly on the “green,” with more than an hour of Error budget remaining.

We can be even more conservative, setting the target to 92.8%, with the following result:

92.8 target
Image 16: SLI analysis for a 92.7% target

Create SLO

With this result, we can Create an SLO with the targets adjusted in this analysis.