Skip to main content

Post-incident target adjustment with SLI Analyzer

Reading time: 0 minute(s) (0 words)

SLI Analyzer can also be helpful when conducting post-mortems on your SLOs after major incidents and allows you to readjust your SLO’s targets and thresholds.

In this use case, we will use a Datadog threshold query from the time that our system had an incident. In our first attempt, we’ll examine the values for our SLO over the last 30 days.

Attempt 1: Analyzing incident

Looking at a 30-day time window

We’d like to readjust our SLO since a recent incident heavily affected its error budget. First, we set up the data import process, by choosing a relevant data source, entering a query, and selecting a 30-day Graph Time window:

configure import
Image 1: Importing incident data

We click the Import Data button and wait for the results.

Now that we’ve imported our data, we can examine the statistical values visible in the SLI Analyzer tab:

raw data
Image 2: Raw incident data

Upon initial examination of our data, we decide that we’d like to see our error budget where 99% of the values in the distribution to be below the 95th percentile:

analyze data
Image 3: Analyzing budget

When the Analysis is completed, we take a look at our burn down chart and error budget values:

analyze data - burn rate
Image 4: Analyzing burn down chart

The logarithmic scale view mode allows us to see a long tail of aggregated high latency values:

analyze data - burn rate
Image 5: SLI values distribution chart - logarithmic scale

We can now estimate the impact of the incident. Even with a generous margin for values, our reliability burn is in the red. We’re left with almost -20 hours of error budget, and the total time of bad values is more than one day.

Our post-mortem investigation showed that an external server caused the incident. It was a one-time event that's unlikely to recur. We thus decide to exclude the incident from our SLO by narrowing down the Graph Time Window to 14 days and see if it’d allow us to regain the error budget.

Attempt 2: Readjusting targets

Narrowing the time window

We need to create a new analysis to see and compare how the error budget is burned for a narrower graph time window. We open a new SLI Analysis tab, select the relevant data source, and enter the same query as above, but we choose the Last 14 days in the graph time window picker:

analyze data - burn rate
Image 6: Importing data for last 14 days

We click the Import Data button and wait until the SLI Analyzer has completed the import. We can see the following raw data:

analyze data - burn rate
Image 7: Analyzing data for a 14-day graph time window

Before we adjust our target, we want to first observe the statistical data and the SLI chart.

We see that the Mean value is around 14ms with a Max value of 68ms for the selected Time window. We already see that this value differs considerably from the 30-day Time window analyses, where the Max value is 1.69s. We also observe that the value for the 99th percentile is around 0.26s (while it was 0.76s in the previous example). We decide to use this value to experiment with the target for our new SLO. Upon recalculation, we observe that excluding the incident allowed us to regain the entire error budget, and we’re left with a healthy margin of 28m 30s (which constitutes around 14% of our budget):

analyze data - burn rate
Image 8: Setting a 0.26 target and analyzing budget

Creating a new SLO

Happy with the outcome, we go ahead and create a new SLO from our analysis.