Skip to main content

Choosing target for a composite SLO

Reading time: 0 minute(s) (0 words)

Balance is a keyโ€‹

High targets result in smaller error budgets, leading to more ambitious reliability goals. A high target serves as an early warning system, alerting you sooner when service reliability needs attention. This can result in higher error budget burn rates and cause more frequent alerts.

While aiming for high reliability is an ambitious goal, perfection is often impractical and unnecessary. Smart SLO monitoring aims for "good enough" targets that represent the lowest tolerable level of reliability, so you can focus effort strategically, save resources, and maintain reliability.

While such decisions usually require discussion and stakeholder consensus, SLOs offer a common, mathematically reasoned framework to facilitate agreements on reliability goals.

This article outlines three approaches to target selection. It illustrates these approaches with example composites that consolidate component data by reliability.

Targets in standard and composite SLOsโ€‹

Setting targets for standard SLOs is close enough to what we want to achieve, observing our service's individual attributes directly. SLIs in standard SLOs usually represent tangible metrics, like API requests or database availability. That's why standard SLOs are very illustrativeโ€”we can clearly define that we need "98% of API requests to be successful in any 28-day period" or "99% of the time in any calendar month the database must be available."

This level of complexity creates a two-fold separation from the direct service attribute: first, by calculating reliability or error budgets for components; then, by deriving a weighted mean from those values, and producing the final reliability or error budget used to define composite targets.

In composite SLOs, "99% reliability target over the 7-day time window" means that the weighted mean reliability of the components over the 7-day period must be at least 99%. Composite SLOs represent diverse metrics, such as API request success rates, response time latencies, batch job success rates, or service uptimeโ€”all of which may not be naturally compatible on their own.

Target for consolidated reliability or error budget

Composite SLOโ€™s target doesnโ€™t refer to any tangible measurement with units but rather to its components' reliability, budgets, and weights.

75% target exampleโ€‹

To illustrate a significantly low reliability target, let's refer to the example of weighted components. For this example, we use the following normalized component weights:

  • 40% for component A
  • 34% for component B
  • 10% for component C
  • 16% for component D

We set the 75% target for this composite, considering that:

  • The target defines the allowable composite error budget
  • The normalized weight determines how sensitive the composite is to the component failure

These key points provide insight into how the composite error budget will respond to component errors.

For a composite with a 75% target, the total error budget is 25 points (100% โ€“ 75%). To determine the impact of a single component, we calculate its reliability thresholdโ€”the point at which that component alone consumes the entire 25% budget:

CT = (100% โ€“ CEB/CNW)

where:

  • CT is the component threshold
  • CEB is the composite error budget
  • CNW is the component normalized weight

The higher a component's weight, the more sensitive the composite is to its performance. This creates a lower tolerance for failure:

ComponentComposite sensitivityCNWCTDescription
Component AHigh40%37.5%It takes a 62.5% drop in reliability for Component A to burn the entire composite error budget
Component BSignificant34%26.5%Component B is capable of burning the entire error budget if its reliability drops by 73.5% or more
Component CNegligible10%-150%Even at 0% reliability, Component C is mathematically incapable of exhausting the error budget on its own because its calculated reliability threshold results in a negative value. It will always stay within the 25% error budget.
Components DNegligible16%-56.25%Component D's weight of 16% is also lower than the 25% error budget. Much like Component C, even a total failure (0% reliability) would not be enough to burn the entire composite error budget.
  • Only components A and B could burn the entire error budget, individually
  • For this to happen, their reliability must drop below:
    • 37.5% for Component A
    • 26.5% for Component B

Conclusion: A 75% target is often too low for meaningful monitoring. "Heavier" components (like A and B) can burn the entire composite error budget individually, while "lighter" components (even those as large as 16%) fall into a "blind spot" and can fail completely without any significant influence on the composite

Lower-bound targetsโ€‹

A composite SLO with the target set within the range of its components' targets reflects the component reliability status more accurately, as in this case, the composite reliability changes similarly to its components. In other words, the composite target is bound to the reliability of its components. Such a target must not be lower than the level at which each component burns its error budget completely, because it can make the composite insensitive to component failures.

We illustrate this statement using the following component targets and normalized weights:

ComponentTargetNormalized weight
Component A99%40%
Component B97.5%34%
Component C99.99%10%
Component D99.9%16%

Calculated composite lower-bound target:

(99% x 0.40) + (97.5% x 0.34) + (99.99% x 0.10) + (99.9% x 0.16) = 98.73%

This means that if every component burns exactly their respective error budget, the composite's reliability will drop to 98.73%.

Key points to consider:

  • Setting the composite target at or slightly above 98.73% ensures that the composite status correlates with the underlying component health.
  • Setting a target significantly below this value (e.g., 75%) reduces composite sensitivity.
  • If all components drop their reliability simultaneously, the composite reliability will mirror that value exactly, regardless of weight distribution.

Midpoint targetsโ€‹

In real-world operations, it is rare for all components of a composite to exhaust their error budgets simultaneously.

More often, components burn at different times. If you set the composite target exactly at the lower-bound level (e.g., 98.73%), it can report high reliability even if one "weighty" component is failing significantly, provided the others are performing perfectly.

For example, if only Component B exhausts its budget (dropping to 97.5%) while all other components remain at 100% reliability, the composite reliability would be 99.15%:

(100% x 0.40) + (97.5% x 0.34) + (100% x 0.10) + (100% x 0.16) = 99.15%

In this scenario, a lower-bound target of 98.73% would fail to report problems, even as the reliability of its significant component drops.

To increase sensitivity to these non-overlapping failures, a common practice is to set the final target midway between the lower-bound floor and 100% perfection. This triggers the composite to signal problems more quickly when several components degrade, even if they haven't fully exhausted their error budgets.

To calculate such a midpoint target, we'll use the following formula:

MT = (LBT + 100%)/2

where:

  • MT is the midpoint target
  • LBT is the lower-bound target

Applying this formula to the above example, we get:

(98.73% + 100%)/2 โ‰ˆ 99.37%

With this midpoint reliability target, the composite actively monitors issues when a significant portion of the entire system reliability begins to degrade.

Key takeawaysโ€‹

Target typeDescriptionBehaviorPurpose
LowUses a target value set at the lower end of the spectrum. Requires extreme values to trigger a signal.Each component burns the error budget based on its normalized weight. Low-weight components have a nearly negligible impact.Understand how individual components contribute to the overall error budget burn.
Lower-boundThe sum of the weighted components' targets. Represents a worst-case scenario.Represents the lowest possible reliability level (a simultaneous drop in all components).A lowest reliability level. Signals when the entire system is at its minimum acceptable state.
MidwaySet between the lower-bound target and 100%. It's a balanced choice for catching issues before the total outage.Reacts to non-overlapping reliability drops. Signals partial reliability failures.Detect and signal partial reliability drops across different components
Check out these related guides and references: