Skip to main content

Entire/Remaining budget would be exhausted

Reading time: 0 minute(s) (0 words)

Ensuring that your SLOs meet their targets is one of the critical pillars of delivering a reliable and high-performing service to your users. But this doesnโ€™t mean there can be no issues or failures along the way.

The key is to be proactive and ready for any incident. A proactive approach involves setting a time buffer to mitigate potential issues and safeguard the budget from being completely depleted in the event of an incident. Nobl9 allows creating such alerting conditions, for example:

Alert me if my Entire / Remaining budget would be exhausted in 4h (and the condition lasts for 15m | based on the alerting window 15m)

4h in the example above represents the amount of time the budget will exhaust, meaning that the reliability of such an SLO will drop below its target.

Choose a value (4h in the example above) that is actionable and adequate to the service itโ€™s set for. A 12h exhaustion time would help detect a slow service reliability degradation, whereas 2h would represent an urgent issue that must be fixed immediately.

Entire vs. remaining budgetโ€‹

Exhaustion conditions predict how long it will take to use up the error budget based on the current situation of an SLO.

Remaining budget variantโ€‹

This variant lets you know if your service needs attention based on how much error budget you have left and how fast you are burning it. If your SLO has a positive error budget, and it looks like it will deplete it in two hours (or whatever condition you have set), Nobl9 will send an alert to notify you.

The more your remaining budget decreases, the more sensitive this condition becomes. When there is no remaining budget for an SLO, any level of burn is enough to activate the alert, as long as it meets the lasts_for condition for all cases where it is utilized.

In the Reliability Burn Down charts displayed below, you can see the progress of three objectives within a single SLO:

  • california with a high amount of the remaining budget left.

  • frankfurt, with no remaining budget left.

  • ohio with a small amount of remaining budget left.

If we used the Remaining budget would be exhausted condition for this SLO, the first objective (california) would require the highest value of burn rate to trigger an alert. The third objective (ohio) would exhaust faster, so the burn rate needed to trigger an Alert based on the same condition would be lower than the first objective. And if thereโ€™s any burn, the second objective (frankfurt) would trigger the alert because thereโ€™s no remaining budget left.

remaining budget RBBs
Image 1: Reliability Burn Down charts for the Remaining budget condition

Entire budget variantโ€‹

In this variant, the current value of the remaining budget doesn't affect the calculation logic. The exhaustion prediction is based on the entire error budget allocation and how fast you burn it. If an SLO reaches exhaustion within, for example, 4 hours, the budget would be exhausted from 100% to 0% in this time frame, resulting in reliability dropping below your set target. This variant is useful if you want to receive alerts for similar incidents regardless of the value of the remaining error budget.

The value used in this condition (timeToBurnEntireBudget) should be greater or equal to the error budget allocation for an SLO using this condition. Itโ€™s because error budget allocation represents the time within which the budget would be exhausted in the most pessimistic scenario, where all events are bad (that is below the threshold raw metrics, no good events for count metrics). Itโ€™s not possible to burn the budget faster than that.

The image below shows the total error budget for objectives. In this section, we see that error budget allocation for objectives is as follows:

  • ok: 7h 12m

  • slow: 3h 36m

  • poor: 43m 12s

If the value of the entire error budget exhaustion used in the alerting condition is (for example) four hours, the only two objectives that will possibly alert are poor and slow because itโ€™s impossible to burn the ok objective in less than seven hours, 12 minutes.

entire budget RBBs
Image 2: Reliability Burn Down charts for the Entire Budget condition
keep in mind
  • During exhaustion, the remaining budget decreases over time.
  • During recovery (only for the rolling time windows), the remaining budget increase over time and caps at 100%.
  • Using the total error budget allocation as a value in this condition will help to catch the most pessimistic burn possible:
total error budget

Remember that such configuration is specific to the objective, so it may be less suited if you want to re-use such alert policy for different SLO configurations.

What's budget exhaustion?โ€‹

Exhaustion refers to the gradual depletion of the error budget over time.

An error budget is exhausted when it has no remaining budget. If there is a remaining error budget, then the budget is not exhausted.

Any positive amount of burn rate means that the budget is currently being exhausted (the exhaustion process is happening).

Key takeawaysโ€‹

Not every exhaustion is harmful.
When your budget is exhausting error budget very slowly, it doesnโ€™t mean you will burn it.
Exhaustion of the error budget is natural.
As long as it doesnโ€™t lead to burning the entire error budget. It should not be considered a failure to deliver reliable services.