Skip to main content

SLI Analyzer use cases

Reading time: 0 minute(s) (0 words)

In this article, we will go over tree use cases of adjusting SLO targets using the SLI Analyzer for threshold and ratio metrics. Once we’ve adjusted our SLO targets, we can create a new SLO from our analysis.

Choosing the correct reliability target can be very challenging when it comes to establishing meaningful SLOs. There is no one-size-fits-all solution for good SLOs since reliability targets depend on how the system performed in the past.

SLI Analyzer can retrieve up to 30 days of historical data and lets you try out different reliability targets without creating a full-fledged SLO.

Tools to help you determine the desired reliability settings include the following:

  1. Statistical data displayed at the top of your analysis tab (e.g., Min, Max, StdDev).
  2. The SLI values distribution chart that shows the frequency distribution of the data points in the given time window.

Understanding the historical performance of a service

Let’s assume that we’re an SRE responsible for our application's infrastructure, and we’d like to configure several SLOs on the service to ensure it meets reasonable response times targets.

To understand the historical performance of our system and pick relevant targets before creating SLOs, we used SLI Analyzer.

Go to the SLI Analyzer section on the Nobl9 Web. On the SLI analysis page, we select Datadog as a data source.

In these examples, we're using Datadog, but the overarching concepts are similar for all other observability platforms. In every example, we start with configuring data to be imported.

Use cases

In this example, we're assessing the average response time made by the server to the client. For this, we analyze a Datadog threshold metric query.

Configure the analysis

We set a 14-day graph time window to calculate the error budget for two weeks.

configure import
Configuring data import

Assess raw data

After successful import, we can assess statistical data and percentile values. And view these values visualized in the charts:

raw import data
Complete data import

Since SLI value distribution is wide, we switch to the logarithmic scale to have more meaningful insight:

log scale
SLI values distribution—logarithmic scale

According to the statistical data, our application’s server sometimes responds quickly at around 0.16s—it's the Min value. At other times, it can take almost 17 seconds to send a response—the Max value:

statistical data threshold
Leveraging statistical data

We can also note that their Mean value is around 0.25s—a reasonable average response time. Next, we look at the percentile values and see that their p99 value is slightly below 0.6s. Using the value of the 99th percentile, we set the error budget calculation method to Occurrences with a threshold value of less than 0.6s:

target 0.6
Configuring analysis—Occurrences

After analysis, we can see the Reliability burn down chart along with the analysis results:

burn down chart
Analysis complete—Reliability burn down

According to the analysis results, the percentage of good values is more than 99 (99.103%), and we have more than 10% of the error budget remaining (10.291%)

Adjust target

The SLI values distribution chart reveals that we can take into account only a portion of the values:

0.6 threshold
Data for 0.6 target values

So, let's try a more strict threshold.

Changing the target value to 0.5 exhausts the error budget. Also, we’re slightly off the 99% target:

0.5 threshold
Adjusting metrics

In conclusion, we consider 0.58s to be the acceptable value in the 99th percentile. Changing the values to be less than 0.58 allows us to stay within the error budget over the entire 14-day time window:

0.58 threshold
Adjusting metrics continued

Happy with the outcome, we go ahead and create a new SLO from our analysis.

Query reference

In our examples, we used the following queries:

  • Latency analysis: the threshold metric: avg:trace.http.request.duration{*}
  • Successful request analysis: the ratio metric.
    • Good query: sum:trace.http.request.hits{http.status_code:200}.as_count()
    • Total query: sum:trace.http.request.hits{*}.as_count()
  • Post-incident analysis: the threshold metric: avg:trace.http.request.duration{*}
For a more in-depth look, consult additional resources: