Aggregation metrics

Reading time: 0 minute(s) (0 words)

While standard SLI aggregation combines raw data points over time, composite SLOs aggregate data from multiple child components to form a single parent value using one of the two aggregation metrics. Understanding these metrics is crucial for interpreting your composite SLO's behavior and selecting the right approach for your use case.

The following aggregation metrics are available:

Reliability provides continuous monitoring. With the right target and weights, it is highly sensitive to degradations in critical components.
Error budget state better fits compliance reporting based on binary monitoring. It is sensitive to budget exhaustion.

The choice between these metrics significantly impacts how your composite SLO behaves and what insights it provides.

Aggregation metric	How it works	Calculated values	Composite result
Reliability	Aggregates the weighted per-minute reliability of all components at one-minute intervals	The weighted mean of each component's per-minute reliability	The composite per-minute reliability is the weighted mean of per-minute reliabilities of its components
Error budget state	Aggregates the error budget status of each component (in other words, whether each component has its error budget remaining)	Each component is assigned a binary status, 1 or 0, for each minute	The composite per-minute reliability is the weighted mean of the per-minute error budget states of all components

Practical recommendations

Because the reliability metric is continuous and the error budget state is binary, they behave differently when aggregated. The hints provided below assume all components in a composite SLO have equal weights.

Key point	Reliability	Error budget state
Choose for	Operational purposes to track trends over time. How well is the system running right now?	Contractual or compliance reporting to be aware if the SLO has been broken. Are we adhering to our agreements?
Why this choice	A nuanced reliability monitoring. Components rarely hit extreme 0% or 100% reliability. Per-minute calculations capture subtle changes.	A strategic big picture. Component state is 1 until it burns its entire error budget, then it drops to 0. Often preferred by executives focused on total downtime or breaches.
Core logic	Declines proportionally to component health (e.g., 99% → 95% → 90%), allowing for early intervention.	A component stays at 1 until its budget is fully exhausted, at which point it drops to 0.
Trade-offs	Averages can hide failures. A failing component might be masked by others. Compensate with balanced weights and a reasonable target.	Highly sensitive to exhaustion and masks gradual decline. You won't see a component failing until the budget is gone.