Skip to main content

Aggregation metrics

Reading time: 0 minute(s) (0 words)

While standard SLI aggregation combines raw data points over time, composite SLOs aggregate data from multiple child components to form a single parent value using one of the two aggregation metrics. Understanding these metrics is crucial for interpreting your composite SLO's behavior and selecting the right approach for your use case.

The following aggregation metrics are available:

  • Reliability provides continuous monitoring. With the right target and weights, it is highly sensitive to degradations in critical components.
  • Error budget state better fits compliance reporting based on binary monitoring. It is sensitive to budget exhaustion.

The choice between these metrics significantly impacts how your composite SLO behaves and what insights it provides.

Aggregation metricHow it worksCalculated valuesComposite result
ReliabilityAggregates the weighted per-minute reliability of all components at one-minute intervalsThe weighted mean of each component's per-minute reliabilityThe composite per-minute reliability is the weighted mean of per-minute reliabilities of its components
Error budget stateAggregates the error budget status of each component (in other words, whether each component has its error budget remaining)Each component is assigned a binary status, 1 or 0, for each minuteThe composite per-minute reliability is the weighted mean of the per-minute error budget states of all components

Practical recommendations

Because the reliability metric is continuous and the error budget state is binary, they behave differently when aggregated. The hints provided below assume all components in a composite SLO have equal weights.

Key pointReliabilityError budget state
Choose forOperational purposes to track trends over time.
How well is the system running right now?
Contractual or compliance reporting to be aware if the SLO has been broken.
Are we adhering to our agreements?
Why this choiceA nuanced reliability monitoring.
Components rarely hit extreme 0% or 100% reliability. Per-minute calculations capture subtle changes.
A strategic big picture.
Component state is 1 until it burns its entire error budget, then it drops to 0. Often preferred by executives focused on total downtime or breaches.
Core logicDeclines proportionally to component health (e.g., 99% → 95% → 90%), allowing for early intervention.A component stays at 1 until its budget is fully exhausted, at which point it drops to 0.
Trade-offsAverages can hide failures. A failing component might be masked by others. Compensate with balanced weights and a reasonable target.Highly sensitive to exhaustion and masks gradual decline. You won't see a component failing until the budget is gone.
Check out these related guides and references: