Reliability Roll-up report

Reading time: 0 minute(s) (0 words)

The Reliability Roll-up report is a powerful tool that empowers you to define the significance and relevance of your SLOs within the context of your system. It allows for the aggregation of reliability measures from multiple SLOs to tailor the data to your needs. With the Reliability Roll-up report, you have the flexibility to choose from predefined filters or design a custom data structure for a more profound understanding of your system's reliability.

The true essence of this Report lies in its ability to simplify the complexity of reliability into a single, easy-to-understand metric – the Reliability score that provides a precise assessment of your system's overall health. In the Reliability Roll-up report each SLO is assigned a Reliability score, which is calculated based on its performance against target objectives and Time windows. These individual Scores are then consolidated within the report structure, resulting in an aggregated Reliability Score at each level of the Report's structure.

Even if you're busy with other things, using the Reliability Roll-up report, you can quickly assess the overall reliability of Services in your organization.

Reliability Roll-up report is useful for:

Drawing a high-level overview of organization-wide reliability for a specific period
Measuring reliability tailored to your system and needs
Making informed decisions quicker based on reliability
Driving organization-wide adoption of SLOs based on easy-to-digest data

Creating a reliability roll-up report

Step 1: Name the report and choose its type

For detailed instructions on Step 1 of the Report Wizard, check the main Reports documentation.

Step 2: Create reliability score layers

You can create an auto-generated or custom-made structure for your report to organize your Nobl9 resources (services, projects, and SLOs) and create layers for the reliability score.

You can change the type of structure and layers of your existing Reliability Roll-up reports.

Auto-generated structure

Using an auto-generated structure, you can choose filters to create reliability score layers that mirror your organization’s project-service-SLO dependencies. Check the main Reports documentation to learn more.

replay source config — Image 2: Step 2 – Create auto-generated structure

Custom structure

Using a custom structure, you can adapt the Reliability Roll-up report to your requirements, creating custom layers for the overall reliability score. Using this option, you can add single resources, folders, or subfolders that contain your resources:

Folders are useful for creating a custom structure for your Nobl9 resources (services, projects, and SLOs). Here are several things to know:

Folders can contain individual resources and other folders
Folders and child folders create layers that aggregate the reliability scores of the resources they contain
The reliability score of a parent folder is calculated as the average of the reliability scores of all the resources and child folders it contains
For child folders, the maximum level of nesting is 8
Projects and Services that don’t contain SLOs won’t affect the reliability score calculations. Effectively, a folder/child folder that contains only “empty” projects and services will display an N/A value in the reliability drill-down section

tip

When added in this step, you can easily rearrange your resources and folders. For this, hover on the six dots next to each folder/resource tile and click the down-/up-pointing arrows.

You can also rename your folders: just click on their display name and edit it. The name can be max. 63 characters long and can contain diacritic and special characters.

Reliability Roll-up report and RBAC

You can make your report available to others by sharing it. The Reliability Roll-up report will be visible to all users with access to your report's SLOs and projects.

For a custom structure, if you don't include any SLOs (or projects) and share the report, everyone will be able to access it.

This will happen even if those folders originated from your existing projects or services. When empty, such folders lose their RBAC properties and become standalone entities. Once you've added SLOs to this report, it will disappear for users who don’t have access to them via their RBAC permissions.

Step 3: Select time range

Check the main Reports documentation for details on this step
Currently - all time ranges in the Reliability Roll-up report are calculated in the UTC time zone

tip

You can edit the time range of your existing Reliability Roll-up report.

To do that, go to the Reports list and click the pencil icon next to the report that you’d like to change. Then, go to Step 3 of the Report wizard.

Report overview

Reliability report — Image 5: Overview of the Reliability Roll-up report (auto-generated structure)

Reliability score: The reliability score displays the value of the score for the parent folder (see section below for details about the reliability score calculations).
note
If there's data for the current and previous report range, Nobl9 displays the calculated trend below the reliability score. The trend will be hidden if there's no data for both ranges.
Image 6: Trend values for reliability score. From left to right: positive, negative, no data.
The least reliable: This tile displays up to 10 folders with the lowest reliability scores. Folders are sorted according to the reliability score value.
Note
The reliability score value is always displayed for each folder within the report's structure. If a parent folder contains only one child folder, their reliability score will be the same, so both folders will be on the Top 10 list with the same value.
The most reliable: This tile displays up to 10 folders with the highest reliability scores. Folders are sorted according to the reliability score value.
Total SLOs: The total number of SLOs in each report.
note
This value might change for an auto-generated structure. For example, if you select a project as the only filter and a new SLO appears within that project, the report will be updated, and the total SLO count will change.
Total folders: The total number of folders in the report. Note that the main parent folder is not counted here; it’s a virtual folder that displays the overall calculated reliability score.
note
This value might change in an auto-generated structure if you add new services to your projects.
SLOs within budget: Expressed as a value (the number of SLOs) and percentage (total SLOs).
SLOs over budget: Expressed as a value (the number of SLOs) and percentage (total SLOs). An SLO is considered as over-budget if its error budget is depleted for even a minute within the reporting time range. Any SLO with a reliability score less than 100% will be deemed as over-budget.
Report drill-down: Allows you to examine your reliability score data and the structure more closely.
Reliability score & trend - If an SLO includes only one objective, we show this value once for the objective and the entire SLO. If an SLO contains more than one objective, we show these values for each objective and separately for the entire SLO (for the SLO, it’s the average calculated from objectives).

What is the reliability score?

The reliability score measures your system’s health based on how often your SLOs meet their targets. If an SLO consistently meets its target and never exceeds its error budget, the score will be 100%. If an SLO falls below its target for 10% of the measured period, the score will be 90%.

Reliability score calculations

The method for calculating the reliability score varies based on the type of the time window associated with a service level objective (SLO).

For SLOs using rolling time windows, where data points are consistently added and dropped as the window moves forward, the reliability score is computed by considering every data point's adherence to the SLO target and calculating a daily target adherence percentage. See section below for details.

In the case of calendar-aligned SLOs, the primary focus is on how the SLO adheres to its target at the end of its calendar-aligned windows, calculating the score based on the final measurements. This approach ensures that the reliability score accurately reflects the health of such SLOs. See section below for details.

SLOs with rolling time windows

For the rolling-type time windows, the reliability score is calculated as the ratio of values within budget to the sum of values within budget and the values that exceeded budget. Nobl9 uses the metric for the Remaining error budget and categorizes returned data points as:

within budget if the remaining budget is greater than or equal to 0
over budget if the remaining budget is less than 0

The counts for each SLO’s objectives above and below the error budget are aggregated daily. Effectively, the reliability score for an objective in the reporting time window is an average daily result.

Example 1: Burn down chart and reliability score

The following image shows a burn-down chart for an SLO with a rolling time window with two objectives, a and b:

Based on these values, the reliability score for the displayed time range will be as follows:

burn down rolling — Image 8: Burn-down chart for an SLO with a rolling time window

As we can see, the reliability score for the objective a is 0%, since the objective was consistently below the target throughout the reporting time range. We can also see that the objective increased its reliability by 48.73%. The total reliability score for this SLO = 24.36%, which is an average score for this SLO's objectives (48,73% + 0%/2 = 24.365%).

SLOs with calendar-aligned time windows

For objectives that adhere to calendar-aligned time window SLOs, Nobl9 calculates the reliability score at the end of the day and at the end of the calendar-aligned time window for all objectives. The reliability score is calculated by dividing the last value of the calculated data point from the burn down chart (called good-to-total-ratio). The following logic applies:

If the value for good-to-total-ratio is greater than or equal to the target, the reliability score equals 100%
If the value for good-to-total-ratio is less than 0, then the reliability score is less than 100% and equals good_total_ratio/target

Nobl9 uses the final data points of completed Time Windows for SLOs within the reporting time range of the Reliability Roll-up report in calendar-aligned SLO objectives. Nobl9 also includes the daily reliability score from the end of the reporting time window if that day isn’t already in the final data points of completed SLO time windows.

Reliability score consistently averages those results, for example:

The reliability score for the time window that ended during the Reporting time window is 94%, and the daily reliability score at the end of this window is 100%. RG of this objective would be 97%:

calculation for the reliability score — Image 9: Calculation for the reliability score

Example 2: Burn down chart and reliability score

The following image shows a burn-down chart for an SLO with a calendar time window with one objective:

RS burn down for a calendar-aligned slo — Image 10: Burn down chart for an SLO with a calendar time window

Based on the value marked as a red dot in the burn down chart (the last value in the time range), the reliability score for the SLO is as follows:

RS for calendar-aligned report — Image 11: Reliability score and trend for the calendar-aligned SLO

Aggregation of reliability score values

The aggregate value of the reliability score for aggregation level would be an average of the reliability score values from its child level.

note

Mathematically, the score layer structure is a calculation formula where SLOs are the variables to calculate, and these are grouped into score layers for calculation purposes. The calculation starts at the lowest layer.

At each subsequent higher layer, the average is derived by summing the average values of all the immediate underlying layers. This recursive calculation continues upwards through the layers.

For example:
- In the hierarchy, there is a folder called Producers. Inside are folders named Data Intake with a Reliability Score = 95,5% and Data Processor with a reliability score = 90%. The reliability score for the parent folder (Producers) will be 92,75%:

reliability score folders — Image 12: Calculations for Reliability Roll-up report folders

What does it mean that an SLO’s objective is over budget?

Rolling time windows

An SLO is over budget if its value for the remaining budget daily drops below its target for at least one calculation point.

Calendar-aligned time windows

An SLO is over budget if its reliability score of at least one ‘end of the window’ value (the end of the Reporting time range or end of any SLO time window within the Reporting time range) is not 100%.

What does it mean that an SLO is `over budget`?

At least one of the SLO’s objectives is over budget in the specified time range.

What does it mean that an SLO is `within budget`?

All SLO’s objectives are within budget in the specified time range.

Reliability score calculations and Replay

If you run Replay for any SLO included in your Reliability Roll-up report, once the process for reimporting historical data has been completed, Nobl9 will recalculate and update the reliability score in the background.

Other notes

All time ranges in the Reliability Roll-up report are calculated in the UTC time zone
Composite SLOs aren’t added as separate objectives in the Reliability Roll-up report

Troubleshooting

I can’t change the time zone in the Reliability Roll-up report

Currently, the reliability score is calculated in the UTC time zone.

My report displays the `N/A` value for the reliability score

If you’ve created the Reliability Roll-up report and see the N/A value for the reliability score, wait 24 hours for the reliability data to populate.

tip

The reliability score is calculated daily for every SLO. When SLOs collect data within the time range of a report, the RS becomes visible shortly after generating that report. If a new SLO is created or didn't gather data during the reporting period, the reliability score will be displayed as N/A.

For example, if an SLO is generated in August and a report is created for July, the reliability score will show N/A. We suggest running Replay to populate your SLO with historical data in such cases.

I can’t see services/projects in my report

In the auto-generated structure, if you add a service/project to the Reliability Roll-up report and these projects or services don’t contain an SLO, they won’t appear in the Reliability Roll-up report.

In the custom structure, the Reliability Roll-up report displays empty projects and services since they’re treated as separate folders (we assume that users may want to add empty folders).

Those services/folders will not be calculated for the reliability score.

Creating a reliability roll-up report​

Step 1: Name the report and choose its type​

Step 2: Create reliability score layers​

Auto-generated structure​

Custom structure​

Reliability Roll-up report and RBAC​

Step 3: Select time range​

Report overview​

What is the reliability score?​

Reliability score calculations​

SLOs with rolling time windows​

Example 1: Burn down chart and reliability score​

SLOs with calendar-aligned time windows​

Example 2: Burn down chart and reliability score​

Aggregation of reliability score values​

What does it mean that an SLO’s objective is over budget?​

Rolling time windows​

Calendar-aligned time windows​

What does it mean that an SLO is over budget?​

What does it mean that an SLO is within budget?​

Reliability score calculations and Replay​

Other notes​

Troubleshooting​

I can’t change the time zone in the Reliability Roll-up report​

My report displays the N/A value for the reliability score​

I can’t see services/projects in my report​