Skip to main content

Catch on SLO configuration

Reading time: 0 minute(s) (0 words)
Learn the magic behind a meaningful SLO

An SLO reflects your service reliability expectations. Proper configuration is crucial for identifying issues and making informed decisions about necessary measures when your system doesnโ€™t perform as expected. In this SLO tutorial, we explore how threshold metrics and ratio metrics contribute to reliability monitoring and error budget tracking. We'll also take a look at data count methods applied to ratio metrics.

Threshold vs. ratioโ€‹

The first step is choosing the right query type: threshold or ratio.

Threshold SLOs
You specify a single query returning one data stream and set a threshold for evaluation.
Example: Nobl9 queries your data source for 95th percentile of your web server latency in milliseconds. With a threshold set at <= 200, if the data shows 150ms p95 latency then SLI doesn't indicate any error. With an SLO target set to 98% if SLI indicates an error only during 1% of a specified time window, your SLO is considered healthy.
This method assumes the Occurrences budgeting method application.
With this SLO type, you can efficiently monitor website latency.
Ratio SLOs
You specify two queries to generate two data streamsโ€”good and total points (or bad and total points). You then set a target for the good-to-total (or bad-to-total) ratio for evaluation.
Example 1: Nobl9 queries your data source for website requests. You target a 90% good-to-total ratio. If 95 out of total 100 requests are successful within the specified time window, your SLO meets the target.
Example 2: Alternatively, tracking count of 404 requests with a 3% target (bad-to-total ratio) means if 2 out of total 100 requests had that error code within specified time window, the SLO is healthy.
This method assumes the Occurrences budgeting method application.
With this SLO type, you can efficiently monitor website requests.

Ratio metric key pointsโ€‹

While a threshold metric only requires a query and a threshold value, a ratio metric requires additional setup to calculate the good-to-total or bad-to-total ratios, crucial for detailed reliability monitoringโ€”the data count method.

Incremental vs. non-incremental data count methodsโ€‹

Determining how data is counted is essential for accurate metric tracking:

  • Incremental
    • Continuously increases over time, capturing cumulative counts of events. These metrics require subtraction of values between two points in time to calculate the data for a specific period.
    • Example: Total HTTP requests, where each new data point is the running total since the start.
  • Non-incremental
    • Provides the actual values for a specific period between consecutive points. Each data point represents a standalone value for that time interval.
    • Example: Rate of processed events in each queried interval.

Read more about incremental and non-incremental ratio metrics.

Good-to-total and bad-to-total typesโ€‹

Your data source might support bad-to-total ratio metrics. The key difference between this and a good-to-total metric lies in which data stream is the primary focus: good or bad. For example, a 90% target has different implications depending on the ratio type:

  • Good-to-total: The SLO is healthy if 90% or more of the data points are good.
  • Bad-to-total: The SLO is healthy if 10% or fewer of the data points are bad.

Tips on naming and structuringโ€‹

If you're unsure how to proceed, here are some example scenarios that illustrate meaningful SLOs:

Website availability monitoring:
Aim: to ensure the website is highly available to users.
Count of HTTP response codes (2XX responses) must be greater than 98% of total requests.
Measures the total count of successful incoming requests.
Metric: good-to-total with the incremental data count method
For example, Website requests success rate (target: >98%), Homepage uptime percentage (target: >98%)
Error budget tracking for critical failures:
Aim: to take measures if critical failure percentage increases.
HTTP 5XX server-side errors must be less than 1% of total requests.
Monitors the rate of critical failures.
Metric: bad-to-total with the non-incremental data count method
For example, Payment gateway error rate (target: <1%), Critical API failure rate (target: <1%)
Application CPU utilization:
Aim: to ensure CPU utilization remains below the acceptable threshold to maintain system performance.
Utilization should remain below 80% on average.
Measures the percentage of CPU usage on application servers.
Metric: average CPU utilization
For example, Application CPU utilization (threshold: <80%), Production service CPU limit (threshold: <80%)
Web page load total time:
Aim: to ensure web page load times remain fast for optimal user experience.
Load time for critical pages must stay below 2 seconds for 99% of traffic.
Measures the average load time of critical pages.
Metric: average page load time
For example, Page load latency (threshold: <2s), Checkout latency (threshold: <1.5s)
Let's move on to understand what this all results in