Skip to main content

SLO configuration use case

Reading time: 0 minute(s) (0 words)

This article will walk you through creating an SLO for a sample service using sloctl.

Selecting a reasonable target

You can test and adjust your SLO's target and view updated error budgets and error budget burn down with SLI Analyzer. Experiment safely, then create an SLO right from the analysis, once you've established the right target using the existing data.

RESTful service response latency

This example demonstrates how to monitor response latency for a RESTful API service. Our goal is to create an SLO that provides a bottom line for service performance, along with early warning signs of performance degradation before it impacts user experience. Consider common user tasks, such as completing a checkout or logging in to an account. If response latency for these tasks exceeds target performance, it can lead to frustrated users and potential business losses.

Strategy at a glance

Configuration pointChoiceReasoning
SLI specificationResponse latencyDirectly correlates to user satisfaction and perceived "snappiness."
Query typeThresholdAllows us to define specific millisecond boundaries for "good" vs. "bad" performance.
Time window28-day rollingCaptures recent trends while smoothing out low-traffic periods like weekends.
Error budget calculation methodOccurrencesScales with traffic; prevents "false positives" during periods of zero requests.

Understanding the time window

The choice of a rolling time window is critical for operational health. Unlike calendar-aligned time windows (which act like buckets that refill at the start of the specified calendar period), a rolling time window constantly advances and provides a continuous view.

When a latency spike occurs, it burns the error budget immediately. As the time window moves forward, the spike falls out of calculations, and the budget naturally recovers. With this time window, the SLO reflects the current user experience.

Time window configuration
timeWindows:
- unit: Day
count: 28
isRolling: true

Why occurrences?

For RESTful services, traffic is rarely uniform.

Time slices count the number of good minutes. However, if no one uses the service at 3:00 AM, the minute is considered bad, which isn't true.

Occurrences count individual requests. If there are 1,000 requests and 999 are fast, your reliability is 99.9%. It considers only requests, not when those requests occurred. This method ensures the SLO reporting depends on actual user impact.

Error budget calculation method configuration
spec:
# Other configuration parameters
budgetingMethod: Occurrences

The three-tiered objectives

We have defined three performance tiers based on business impact and user experience research.

The target % refers to the minimum percentage of requests that must be faster than the threshold.

  • Perfect tier is our service performance goal:
    • Threshold—200 ms
    • Target—99.5%
    • This represents exceptional performance. Most requests should complete very quickly, with only 0.5% allowed to exceed this threshold.
The Perfect objective configuration
displayName: Perfect
value: 200
name: perfect
target: 0.995
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
# This objective is set as primary for better visibility—the SLO details will display it immediately on the Overview tab
primary: true
# The less than or equal to operator applies to the values of the retrieved data—data values of up to 200 milliseconds are considered good and won't burn the error budget
op: lte
  • Acceptable tier is the warning level:
    • Threshold—500 ms
    • Target—99.9%
    • When response times reach 500ms, the experience is still acceptable but no longer optimal. We require 99.9% of requests to stay under this limit to prevent degradation from becoming the norm.
The Acceptable objective configuration
displayName: Acceptable
value: 500
name: acceptable
target: 0.999
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
op: lte
  • Unacceptable tier is the critical breach level:
    • Threshold—1000 ms
    • Target—99.95%
    • 1000ms (1 second) is our absolute limit for acceptable user experience. Only 0.05% of requests can exceed this threshold before users perceive the service as broken.
The Unacceptable objective configuration
displayName: Unacceptable
value: 1000
name: unacceptable
target: 0.9995
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
op: lte

YAML configuration

When translating this to a Nobl9 YAML, ensure you use the target field for your percentages and the value field for your millisecond thresholds.

Sample SLO configuration
apiVersion: n9/v1alpha
kind: SLO
metadata:
name: restful-service-latency-slo
displayName: RESTful service latency SLO
project: customer-facing-applications
spec:
description: Early warning signs of response latency degradation for the RESTful service
indicator:
metricSource:
name: my-data-source
project: telemetry
kind: Agent
budgetingMethod: Occurrences
objectives:
- displayName: Perfect
value: 200
name: perfect
target: 0.995
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
primary: true
op: lte
- displayName: Acceptable
value: 500
name: acceptable
target: 0.999
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
op: lte
- displayName: Unacceptable
value: 1000
name: unacceptable
target: 0.9995
rawMetric:
query:
generic:
query: >-
SELECT percentile(response_time_ms, 95)
FROM http_requests
WHERE service = 'restful-api'
AND timestamp BETWEEN N9FROM AND N9TO
GROUP BY time(1m)
op: lte
service: restful
timeWindows:
- unit: Day
count: 28
isRolling: true

SLO target logic

In a latency SLO, we are measuring reliability through response time targets. The target represents the percentage of requests that must be "good" (no longer than the threshold).

As performance requirements get stricter (lower millisecond values), we can afford slightly more lenient targets since we're setting stretch goals. As we move toward unacceptable performance levels (higher millisecond values), we need increasingly strict targets because we’re approaching the point where user experience becomes poor.

Logic verification table

This table provides us with hints—with it, we can verify our SLO objectives before deploying the YAML configuration:

ObjectiveThreshold (limit)Meaning of thresholdTarget (reliability)Meaning of target
Perfect200 ms"We want requests to be this fast..."99.5%"...and we'll be satisfied if 99.5% of them hit this mark. We can tolerate 0.5% of requests to take longer."
Acceptable500 ms"If it takes this long, it's still okay..."99.9%"...but we need at least 99.9% to stay under this limit. We can tolerate 0.1% of requests to take longer."
Unacceptable1000 ms"This is our absolute tolerance limit..."99.95%"...so 99.95% must be faster than this to avoid user frustration. Only 0.05% of requests can take longer."

Visualizing the thresholds

When you look at a latency distribution, these objectives act as quality gates at different performance levels:

  • The 200ms gate captures exceptional performance
  • The 500ms gate catches acceptable but suboptimal performance
  • The 1000ms gate is the final defense before user experience degrades significantly
Check out these related guides and references: