Use Cases of SLI Analyzer (part 2) - Post-Incident Target Adjustment
SLI Analyzer can also be helpful when conducting post-mortems on your SLOs after major incidents and allows you to readjust your SLOβs targets and thresholds.
In this use case, we will use a Datadog threshold query from the time that our system had an incident. In our first attempt, weβll examine the values for our SLO over the last 30 days.
Attempt 1: Analyzing Incidentβ
Looking at a 30-day Time Windowβ
Weβd like to readjust our SLO since a recent incident heavily affected its error budget. First, we set up the data import process, by choosing a relevant data source, entering a query, and selecting a 30-day Graph Time window:

We click the Import Data button and wait for the results.
Now that weβve imported our data, we can examine the statistical values visible in the SLI Analyzer tab:

Upon initial examination of our data, we decide that weβd like to see our error budget where 99%
of the values in the distribution to be below the 95th percentile:

When the Analysis is completed, we take a look at our burn down chart and error budget values:

The logarithmic scale view mode allows us to see a long tail of aggregated high latency values:

We can now estimate the impact of the incident. Even with a generous margin for values, our reliability burn is in the red. Weβre left with almost -20 hours
of error budget, and the total time of bad values is more than 1 day.
Our post-mortem investigation showed that an external server caused the incident. It was a one-time event that's unlikely to recur. We thus decide to exclude the incident from our SLO by narrowing down the Graph Time Window to 14 days and see if itβd allow us to regain the error budget.
Attempt 2: Readjusting Targetsβ
Narrowing Down Time Windowβ
We need to create a new analysis to see and compare how the error budget is burned for a narrower Graph Time Window. We open a new SLI Analysis tab, select the relevant data source, and enter the same query as above, but we choose the Last 14 days
in the Graph Time Window picker:

We click the Import Data button and wait until SLI Analyzer has completed the import. We can see the following raw data:

Before we adjust our target we want to first observe the statistical data and the SLI chart.
We see that the Mean
value is around 14ms with a Max
value of 68ms
for the selected Time window. We already see that this value differs considerably from the 30-day Time window analyses, where the Max
value is 1.69s
. We also observe that the value for the 99th percentile is around 0.26s
(while it was 0.76s
in the previous example). We decide to use this value to experiment with the target for our new SLO. Upon recalculation, we observe that excluding the incident allowed us to regain the entire error budget, and weβre left with a healthy margin of 28m 30s
(which constitutes around 14%
of our budget):

Creating a New SLOβ
Happy with the outcome, we go ahead and create a new SLO from our analysis.