Aggregation metrics
Reading time: 0 minute(s) (0 words)
While standard SLI aggregation combines raw data points over time, composite SLOs aggregate data from multiple child components to form a single parent value using one of the two aggregation metrics. Understanding these metrics is crucial for interpreting your composite SLO's behavior and selecting the right approach for your use case.
The following aggregation metrics are available:
- Reliability provides continuous monitoring. With the right target and weights, it is highly sensitive to degradations in critical components.
- Error budget state better fits compliance reporting based on binary monitoring. It is sensitive to budget exhaustion.
The choice between these metrics significantly impacts how your composite SLO behaves and what insights it provides.
| Aggregation metric | How it works | Calculated values | Composite result |
|---|---|---|---|
| Reliability | Aggregates the weighted per-minute reliability of all components at one-minute intervals | The weighted mean of each component's per-minute reliability | The composite per-minute reliability is the weighted mean of per-minute reliabilities of its components |
| Error budget state | Aggregates the error budget status of each component (in other words, whether each component has its error budget remaining) | Each component is assigned a binary status, 1 or 0, for each minute | The composite per-minute reliability is the weighted mean of the per-minute error budget states of all components |
Practical recommendations
Because the reliability metric is continuous and the error budget state is binary, they behave differently when aggregated. The hints provided below assume all components in a composite SLO have equal weights.
| Key point | Reliability | Error budget state |
|---|---|---|
| Choose for | Operational purposes to track trends over time. How well is the system running right now? | Contractual or compliance reporting to be aware if the SLO has been broken. Are we adhering to our agreements? |
| Why this choice | A nuanced reliability monitoring. Components rarely hit extreme 0% or 100% reliability. Per-minute calculations capture subtle changes. | A strategic big picture. Component state is 1 until it burns its entire error budget, then it drops to 0. Often preferred by executives focused on total downtime or breaches. |
| Core logic | Declines proportionally to component health (e.g., 99% → 95% → 90%), allowing for early intervention. | A component stays at 1 until its budget is fully exhausted, at which point it drops to 0. |
| Trade-offs | Averages can hide failures. A failing component might be masked by others. Compensate with balanced weights and a reasonable target. | Highly sensitive to exhaustion and masks gradual decline. You won't see a component failing until the budget is gone. |
Useful links
Check out these related guides and references: