Catch on SLO configuration

Reading time: 0 minute(s) (0 words)

Learn the magic behind a meaningful SLO

An SLO reflects your service reliability expectations. Proper configuration is crucial for identifying issues and making informed decisions about necessary measures when your system doesn’t perform as expected. In this SLO tutorial, we explore how threshold metrics and ratio metrics contribute to reliability monitoring and error budget tracking. We'll also take a look at data count methods applied to ratio metrics.

Threshold vs. ratio

The first step is choosing the right query type: threshold or ratio.

Threshold SLOs

You specify a single query returning one data stream and set a threshold for evaluation.

Example: Nobl9 queries your data source for 95th percentile of your web server latency in milliseconds. With a threshold set at <= 200, if the data shows 150ms p95 latency then SLI doesn't indicate any error. With an SLO target set to 98% if SLI indicates an error only during 1% of a specified time window, your SLO is considered healthy.

This method assumes the Occurrences budgeting method application.

With this SLO type, you can efficiently monitor website latency.

Ratio SLOs

You specify two queries to generate two data streams—good and total points (or bad and total points). You then set a target for the good-to-total (or bad-to-total) ratio for evaluation.

Example 1: Nobl9 queries your data source for website requests. You target a 90% good-to-total ratio. If 95 out of total 100 requests are successful within the specified time window, your SLO meets the target.

Example 2: Alternatively, tracking count of 404 requests with a 3% target (bad-to-total ratio) means if 2 out of total 100 requests had that error code within specified time window, the SLO is healthy.

This method assumes the Occurrences budgeting method application.

With this SLO type, you can efficiently monitor website requests.

Ratio metric key points

While a threshold metric only requires a query and a threshold value, a ratio metric requires additional setup to calculate the good-to-total or bad-to-total ratios, crucial for detailed reliability monitoring—the data count method.

Incremental vs. non-incremental data count methods

Determining how data is counted is essential for accurate metric tracking:

Incremental
- Continuously increases over time, capturing cumulative counts of events. These metrics require subtraction of values between two points in time to calculate the data for a specific period.
- Example: Total HTTP requests, where each new data point is the running total since the start.
Non-incremental
- Provides the actual values for a specific period between consecutive points. Each data point represents a standalone value for that time interval.
- Example: Rate of processed events in each queried interval.

Good-to-total and bad-to-total types

Your data source might support bad-to-total ratio metrics. The key difference between this and a good-to-total metric lies in which data stream is the primary focus: good or bad. For example, a 90% target has different implications depending on the ratio type:

Good-to-total: The SLO is healthy if 90% or more of the data points are good.
Bad-to-total: The SLO is healthy if 10% or fewer of the data points are bad.

Tips on naming and structuring

If you're unsure how to proceed, here are some example scenarios that illustrate meaningful SLOs:

Website availability monitoring:: Aim: to ensure the website is highly available to users.; Count of HTTP response codes (2XX responses) must be greater than 98% of total requests.; Measures the total count of successful incoming requests.; Metric: good-to-total with the incremental data count method; For example, Website requests success rate (target: >98%), Homepage uptime percentage (target: >98%)

Error budget tracking for critical failures:: Aim: to take measures if critical failure percentage increases.; HTTP 5XX server-side errors must be less than 1% of total requests.; Monitors the rate of critical failures.; Metric: bad-to-total with the non-incremental data count method; For example, Payment gateway error rate (target: <1%), Critical API failure rate (target: <1%)

Application CPU utilization:: Aim: to ensure CPU utilization remains below the acceptable threshold to maintain system performance.; Utilization should remain below 80% on average.; Measures the percentage of CPU usage on application servers.; Metric: average CPU utilization; For example, Application CPU utilization (threshold: <80%), Production service CPU limit (threshold: <80%)

Web page load total time:: Aim: to ensure web page load times remain fast for optimal user experience.; Load time for critical pages must stay below 2 seconds for 99% of traffic.; Measures the average load time of critical pages.; Metric: average page load time; For example, Page load latency (threshold: <2s), Checkout latency (threshold: <1.5s)

Let's move on to understand what this all results in

Threshold vs. ratio​

Ratio metric key points​

Incremental vs. non-incremental data count methods​

Good-to-total and bad-to-total types​

Tips on naming and structuring​

related topics

Threshold vs. ratio

Ratio metric key points

Incremental vs. non-incremental data count methods

Good-to-total and bad-to-total types

Tips on naming and structuring