Catch on SLO configuration
An SLO reflects your service reliability expectations. Proper configuration is crucial for identifying issues and making informed decisions about necessary measures when your system doesnโt perform as expected. In this SLO tutorial, we explore how threshold metrics and ratio metrics contribute to reliability monitoring and error budget tracking. We'll also take a look at data count methods applied to ratio metrics.
Threshold vs. ratioโ
The first step is choosing the right query type: threshold or ratio.
<= 200
, if the data shows 150ms p95 latency then SLI doesn't indicate any error. With an SLO target set to 98% if SLI indicates an error only during 1% of a specified time window, your SLO is considered healthy.Ratio metric key pointsโ
While a threshold metric only requires a query and a threshold value, a ratio metric requires additional setup to calculate the good-to-total or bad-to-total ratios, crucial for detailed reliability monitoringโthe data count method.
Incremental vs. non-incremental data count methodsโ
Determining how data is counted is essential for accurate metric tracking:
- Incremental
- Continuously increases over time, capturing cumulative counts of events. These metrics require subtraction of values between two points in time to calculate the data for a specific period.
- Example: Total HTTP requests, where each new data point is the running total since the start.
- Non-incremental
- Provides the actual values for a specific period between consecutive points. Each data point represents a standalone value for that time interval.
- Example: Rate of processed events in each queried interval.
Read more about incremental and non-incremental ratio metrics.
Good-to-total and bad-to-total typesโ
Your data source might support bad-to-total ratio metrics. The key difference between this and a good-to-total metric lies in which data stream is the primary focus: good or bad. For example, a 90% target has different implications depending on the ratio type:
- Good-to-total: The SLO is healthy if 90% or more of the data points are good.
- Bad-to-total: The SLO is healthy if 10% or fewer of the data points are bad.
Tips on naming and structuringโ
If you're unsure how to proceed, here are some example scenarios that illustrate meaningful SLOs:
- Website availability monitoring:
- Aim: to ensure the website is highly available to users.
- Count of HTTP response codes (2XX responses) must be greater than 98% of total requests.
- Measures the total count of successful incoming requests.
- Metric: good-to-total with the incremental data count method
- For example, Website requests success rate (target: >98%), Homepage uptime percentage (target: >98%)
- Error budget tracking for critical failures:
- Aim: to take measures if critical failure percentage increases.
- HTTP 5XX server-side errors must be less than 1% of total requests.
- Monitors the rate of critical failures.
- Metric: bad-to-total with the non-incremental data count method
- For example, Payment gateway error rate (target: <1%), Critical API failure rate (target: <1%)
- Application CPU utilization:
- Aim: to ensure CPU utilization remains below the acceptable threshold to maintain system performance.
- Utilization should remain below 80% on average.
- Measures the percentage of CPU usage on application servers.
- Metric: average CPU utilization
- For example, Application CPU utilization (threshold: <80%), Production service CPU limit (threshold: <80%)
- Web page load total time:
- Aim: to ensure web page load times remain fast for optimal user experience.
- Load time for critical pages must stay below 2 seconds for 99% of traffic.
- Measures the average load time of critical pages.
- Metric: average page load time
- For example, Page load latency (threshold: <2s), Checkout latency (threshold: <1.5s)