Use case of SLO configuration

Reading time: 0 minute(s) (0 words)

In this section, we walk through creating an SLO for a sample service using sloctl.

tip

Remember that you can test and adjust your SLO's targets and view updated error budgets and error budget burn down with SLI analyzer.

Once you've determined the right target using your existing data, you can easily create a new SLO.

A typical example of a latency SLO for a RESTful service

First, we want to pick an appropriate service level indicator to measure the latency of responses from a RESTful service. In this example, let's assume our service runs in the NGINX web server, and we're going to use a threshold-based approach to define acceptable behavior. For example, we want the service to respond in a certain amount of time.

tip

There are many ways to measure application performance. In this case we're giving an example of server-side measurement at the application layer (NGINX). However, it might be advantageous for your application to measure this metric differently.

For example, you might choose to measure performance at the client, or at the load balancer, or somewhere else. Your choice depends on what you are trying to measure or improve, as well as what data is currently available as usable metrics for the SLI.

The threshold approach uses a single query, and we set thresholds or breaking points on the results from that query to define the boundaries of acceptable behavior. In the SLO YAML, we specify the indicator like this:

   indicator:
      metricSource:
        name: my-prometheus-instance
        project: default
      rawMetric:
        prometheus:
          promql: server_requestMsec{job="nginx"}

In this example, we’re using Prometheus, but the concepts are similar for other metrics stores. We recommend running the query against your Prometheus instance and reviewing the resulting data, so you can verify that the query returns what you expect and that you understand the units (whether it's returning latencies as milliseconds or fractions of a second, for example). This query seems to return data in between 60 and 150 milliseconds, with some occasional outliers.

Choosing a time window

When creating an SLO, we need to choose whether we want a rolling or calendar-aligned window:

Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, such as every calendar month, or every quarter.
Rolling windows are better for tracking the recent user experience of service (say, over the past week or month).

For our RESTful service, we will be using a rolling time window to measure the recent user experience. This will help us make decisions about the risk of changes, releases, and how best to invest our engineering resources on a week-to-week or sprint-to-sprint basis. We want the "recent" period that we're measuring to trail ong enough to smooth out noise.

We’ll go with a 28-day window, which has the advantage of containing an equal number of weekend days and weekdays as it rolls:

  timeWindows:
    - count: 28
      isRolling: true
      period:
        begin: "2020-12-01T00:00:00Z"
      unit: Day

Choosing a budgeting method

There are two budgeting methods to choose from: Time Slices and Occurrences.

In the Time Slices method, we count (the objective we measure) how many good minutes were achieved (minutes where our system operates within defined boundaries), compared to the total number of minutes in the window.

This is useful for some scenarios but has a disadvantage when looking at the “recent” user experience, as with this SLO. The disadvantage is that a bad minute that occurs during a low-traffic period (say, in the middle of the night for most of our users, when they are unlikely even to notice a performance issue) would penalize the SLO the same amount as a bad minute during peak traffic times.

The Occurrences budgeting method is well suited to this situation. With this method, we count good attempts (in this example, requests that are within defined boundaries) against the count of all attempts (i.e., all requests, including requests that perform outside of the defined boundaries). Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes.

We’ll go with the Occurrences method:

    budgetingMethod: Occurrences

Establishing thresholds

Let’s assume we've talked to our product and support teams and can establish the following thresholds:

The service has to respond fast enough that users don't see any lag in the web applications that use the service. Our Product Manager thinks that 100ms (1/10th of a second) is a reasonable threshold for what qualifies as okay latency. We want to try to hit that 95% of the time, so we code the first threshold like this:

    - budgetTarget: 0.95
      displayName: Laggy
      value: 100
      op: lte

This threshold requires that 95% of requests are completed within 100ms.

You can name each threshold as you like. We recommend naming them how a user of the service (or how another service that uses this service) might describe the experience at a given threshold. Typically, we use names that are descriptive adjectives of the experience when the threshold is not met. When this threshold is violated, we can say that the user's experience is "Laggy."

Some requests fall outside of that 100ms range. We want to make an allowance for that. Still, we also want to set other thresholds to we know that even in its worst moments, our service is performing acceptably. Also, that its worst moments are brief. Let's define another threshold. We allow 5% of requests to run longer than 100ms in the above threshold. We want most of that 5%—say, 80% of the remaining 5% of the queries—to still return within 1/4th of a second (250ms). That means 99% of the queries should return within 250ms (95% +4%), so we’ll add a threshold like this:

    - budgetTarget: 0.99
      displayName: Slow
      value: 250
      op: lte

This threshold requires that 99% of requests are completed within 250ms.

While that covers the bulk of requests, even within the 1% of requests we allow to exceed 250ms, the vast majority of them should complete within half a second (500ms). So, let’s add the following threshold:

    - budgetTarget: 0.999
      displayName: Painful
      value: 500
      op: lte

This threshold requires that 99.9% of requests are completed within 500ms.

In sum, our SLO definition for this example use case looks like this:

- apiVersion: n9/v1alpha
  kind: SLO
  metadata:
    displayName: adminpageload
    name: adminpageload
    project: external
  spec:
    alertPolicies: []
    budgetingMethod: Occurrences
    description: ""
    indicator:
      metricSource:
        name: cooperlab
        projects: default
      rawMetric:
        newRelic:
          nrql: SELECT average(duration)  FROM SyntheticRequest WHERE monitorId = 339adbc4-01e4-4517-88cf-ece25cb66156'

    timeSeries:
      objectives:
      - displayName: ok
        op: lt
        tag: external.adminpageload.70d000000
        target: 0.98
        value: 70

      - displayName: laggy
        op: lt
        tag: external.adminpageload.85d000000
        target: 0.99
        value: 85

      - displayName: poor
        op: lt
        tag: external.adminpageload.125d000000
        value: 125
        service: venderportal

    timeWindows:
      - count: 1
        isRolling: true
        period:
          begin: "2021-03-08T06:46:08Z"
   end: "2021-03-08T07:46:08Z"
        unit: Hour

Useful links

For a more in-depth look, consult additional resources:

SLO calculations guideGuides

Time Slices guideGuides

Occurrences guideGuides

Reliability and error budget calculationsGuides

A typical example of a latency SLO for a RESTful service​

Choosing a time window​

Choosing a budgeting method​

Establishing thresholds​

Useful links​

A typical example of a latency SLO for a RESTful service

Choosing a time window

Choosing a budgeting method

Establishing thresholds

Useful links