Use Cases of SLI Analyzer (part 2) - Post-Incident Target Adjustment
SLI Analyzer can also be helpful when conducting post-mortems on your SLOs after major incidents and allows you to readjust your SLO’s targets and thresholds.
In this use case, we will use a Datadog threshold query from the time that our system had an incident. In our first attempt, we’ll examine the values for our SLO over the last 30 days.
Attempt 1: Analyzing Incident
Looking at a 30-day Time Window
We’d like to readjust our SLO since a recent incident heavily affected its error budget. First, we set up the data import process, by choosing a relevant data source, entering a query, and selecting a 30-day Graph Time window:
We click the Import Data button and wait for the results.
Now that we’ve imported our data, we can examine the statistical values visible in the SLI Analyzer tab:
Upon initial examination of our data, we decide that we’d like to see our error budget where
99% of the values in the distribution to be below the 95th percentile:
When the Analysis is completed, we take a look at our burn down chart and error budget values:
The logarithmic scale view mode allows us to see a long tail of aggregated high latency values:
We can now estimate the impact of the incident. Even with a generous margin for values, our reliability burn is in the red. We’re left with almost
-20 hours of error budget, and the total time of bad values is more than 1 day.
Our post-mortem investigation showed that an external server caused the incident. It was a one-time event that's unlikely to recur. We thus decide to exclude the incident from our SLO by narrowing down the Graph Time Window to 14 days and see if it’d allow us to regain the error budget.
Attempt 2: Readjusting Targets
Narrowing Down Time Window
We need to create a new analysis to see and compare how the error budget is burned for a narrower Graph Time Window. We open a new SLI Analysis tab, select the relevant data source, and enter the same query as above, but we choose the
Last 14 days in the Graph Time Window picker:
We click the Import Data button and wait until SLI Analyzer has completed the import. We can see the following raw data:
Before we adjust our target we want to first observe the statistical data and the SLI chart.
We see that the
Mean value is around 14ms with a
Max value of
68ms for the selected Time window. We already see that this value differs considerably from the 30-day Time window analyses, where the
Max value is
1.69s. We also observe that the value for the 99th percentile is around
0.26s (while it was
0.76s in the previous example). We decide to use this value to experiment with the target for our new SLO. Upon recalculation, we observe that excluding the incident allowed us to regain the entire error budget, and we’re left with a healthy margin of
28m 30s (which constitutes around
14% of our budget):
Creating a New SLO
Happy with the outcome, we go ahead and create a new SLO from our analysis.