Reliability Roll-up report
The Reliability Roll-up report is a powerful tool that empowers you to define the significance and relevance of your SLOs within the context of your system. It allows for the aggregation of reliability measures from multiple SLOs to tailor the data to your needs. With the Reliability Roll-up report, you have the flexibility to choose from predefined filters or design a custom data structure for a more profound understanding of your system's reliability.
The true essence of this Report lies in its ability to simplify the complexity of reliability into a single, easy-to-understand metric – the Reliability score
that provides a precise assessment of your system's overall health. In the Reliability Roll-up report each SLO is assigned a Reliability score
, which is calculated based on its performance against target objectives and Time windows. These individual Scores are then consolidated within the report structure, resulting in an aggregated Reliability Score at each level of the Report's structure.
Even if you're busy with other things, using the Reliability Roll-up report, you can quickly assess the overall reliability of Services in your organization.
Reliability Roll-up report is useful for:
- Drawing a high-level overview of organization-wide reliability for a specific period
- Measuring reliability tailored to your system and needs
- Making informed decisions quicker based on reliability
- Driving organization-wide adoption of SLOs based on easy-to-digest data
Creating a reliability roll-up report
Step 1: Name the report and choose its type
For detailed instructions on Step 1 of the Report Wizard, check the main Reports documentation.
Step 2: Create reliability score layers
You can create an auto-generated or custom-made structure for your report to organize your Nobl9 resources (services, projects, and SLOs) and create layers for the reliability score
.
You can change the type of structure and layers of your existing Reliability Roll-up reports.
Auto-generated structure
Using an auto-generated structure,
you can choose filters
to create reliability score
layers that mirror your organization’s project-service-SLO dependencies.
Check the main Reports documentation to learn more.
Custom structure
Using a custom structure, you can adapt the Reliability Roll-up report to your requirements, creating custom layers for the overall reliability score
. Using this option, you can add single resources, folders, or subfolders that contain your resources:
Folders are useful for creating a custom structure for your Nobl9 resources (services, projects, and SLOs). Here are several things to know:
- Folders can contain individual resources and other folders
- Folders and child folders create layers that aggregate the
reliability scores
of the resources they contain - The
reliability score
of a parent folder is calculated as the average of thereliability scores
of all the resources and child folders it contains - For child folders, the maximum level of nesting is 8
- Projects and Services that don’t contain SLOs won’t affect the
reliability score
calculations. Effectively, a folder/child folder that contains only “empty” projects and services will display anN/A
value in the reliability drill-down section
When added in this step, you can easily rearrange your resources and folders. For this, hover on the six dots next to each folder/resource tile and click the down-/up-pointing arrows.
You can also rename your folders: just click on their display name and edit it. The name can be max. 63 characters long and can contain diacritic and special characters.
Reliability Roll-up report and RBAC
You can make your report available to others by sharing it. The Reliability Roll-up report will be visible to all users with access to your report's SLOs and projects.
For a custom structure, if you don't include any SLOs (or projects) and share the report, everyone will be able to access it.
This will happen even if those folders originated from your existing projects or services. When empty, such folders lose their RBAC properties and become standalone entities. Once you've added SLOs to this report, it will disappear for users who don’t have access to them via their RBAC permissions.
Step 3: Select time range
-
Check the main Reports documentation for details on this step
-
Currently - all time ranges in the Reliability Roll-up report are calculated in the UTC time zone
You can edit the time range of your existing Reliability Roll-up report.
To do that, go to the Reports list and click the pencil icon next to the report that you’d like to change. Then, go to Step 3 of the Report wizard.
Report overview
What is the reliability score?
The reliability score
measures your system’s health based on how often your SLOs meet their targets. If an SLO consistently meets its target and never exceeds its error budget, the score will be 100%
. If an SLO falls below its target for 10% of the measured period, the score will be 90%
.
Reliability score calculations
The method for calculating the reliability score varies based on the type of the time window associated with a service level objective (SLO).
For SLOs using rolling time windows, where data points are consistently added and dropped as the window moves forward, the reliability score is computed by considering every data point's adherence to the SLO target and calculating a daily target adherence percentage. See section below for details.
In the case of calendar-aligned SLOs, the primary focus is on how the SLO adheres to its target at the end of its calendar-aligned windows, calculating the score based on the final measurements. This approach ensures that the reliability score accurately reflects the health of such SLOs. See section below for details.
SLOs with rolling time windows
For the rolling-type time windows, the reliability score
is calculated as the ratio of values within budget
to the sum of values within budget and the values that exceeded budget. Nobl9 uses the metric for the Remaining error budget and categorizes returned data points as:
within budget
if the remaining budget is greater than or equal to 0over budget
if the remaining budget is less than 0
The counts for each SLO’s objectives above and below the error budget are aggregated daily.
Effectively, the reliability score
for an objective in the reporting time window is an average daily result.
Example 1: Burn down chart and reliability score
The following image shows a burn-down chart for an SLO with a rolling time window with two objectives, a
and b
:
Based on these values, the reliability score for the displayed time range will be as follows:
As we can see, the reliability score for the objective a
is 0%
,
since the objective was consistently below the target throughout the reporting time range.
We can also see that the objective increased its reliability by 48.73%
.
The total reliability score for this SLO = 24.36%
,
which is an average score for this SLO's objectives (48,73%
+ 0%
/2
= 24.365%
).
SLOs with calendar-aligned time windows
For objectives that adhere to calendar-aligned time window SLOs, Nobl9 calculates the reliability score
at the end of the day and at the end of the calendar-aligned time window for all objectives. The reliability score
is calculated by dividing the last value of the calculated data point from the burn down chart (called good-to-total-ratio
). The following logic applies:
-
If the value for
good-to-total-ratio
is greater than or equal to thetarget
, thereliability score
equals100%
-
If the value for
good-to-total-ratio
is less than 0, then thereliability score
is less than100%
and equalsgood_total_ratio/target
Nobl9 uses the final data points of completed Time Windows for SLOs within the reporting time range of the Reliability Roll-up report in calendar-aligned SLO objectives.
Nobl9 also includes the daily reliability score
from the end of the reporting time window if that day isn’t already in the final data points of completed SLO time windows.
Reliability score consistently averages those results, for example:
- The reliability score for the time window that ended during the Reporting time window is
94%
, and the daily reliability score at the end of this window is100%
. RG of this objective would be97%
:
Example 2: Burn down chart and reliability score
The following image shows a burn-down chart for an SLO with a calendar time window with one objective:
Based on the value marked as a red dot in the burn down chart (the last value in the time range), the reliability score for the SLO is as follows:
Aggregation of reliability score values
The aggregate value of the reliability score
for aggregation level would be an average of the reliability score values from its child level.
Mathematically, the score layer structure is a calculation formula where SLOs are the variables to calculate, and these are grouped into score layers for calculation purposes. The calculation starts at the lowest layer.
At each subsequent higher layer, the average is derived by summing the average values of all the immediate underlying layers. This recursive calculation continues upwards through the layers.
- For example:
- In the hierarchy, there is a folder called
Producers
. Inside are folders namedData Intake
with aReliability Score
=95,5%
andData Processor
with areliability score
=90%
. Thereliability score
for the parent folder (Producers
) will be92,75%
:
- In the hierarchy, there is a folder called
What does it mean that an SLO’s objective is over budget?
Rolling time windows
An SLO is over budget
if its value for the remaining budget daily drops below its target for at least one calculation point.
Calendar-aligned time windows
An SLO is over budget
if its reliability score
of at least one ‘end of the window’ value (the end of the Reporting time range or end of any SLO time window within the Reporting time range) is not 100%
.
What does it mean that an SLO is over budget
?
At least one of the SLO’s objectives is over budget in the specified time range.
What does it mean that an SLO is within budget
?
All SLO’s objectives are within budget
in the specified time range.
Reliability score calculations and Replay
If you run Replay for any SLO included in your Reliability Roll-up report, once the process for reimporting historical data has been completed, Nobl9 will recalculate and update the reliability score
in the background.
Other notes
-
All time ranges in the Reliability Roll-up report are calculated in the UTC time zone
-
Composite SLOs aren’t added as separate objectives in the Reliability Roll-up report
Troubleshooting
I can’t change the time zone in the Reliability Roll-up report
Currently, the reliability score
is calculated in the UTC time zone.
My report displays the N/A
value for the reliability score
If you’ve created the Reliability Roll-up report and see the N/A
value for the reliability score
, wait 24 hours for the reliability data to populate.
The reliability score
is calculated daily for every SLO. When SLOs collect data within the time range of a report, the RS becomes visible shortly after generating that report. If a new SLO is created or didn't gather data during the reporting period, the reliability score
will be displayed as N/A
.
For example, if an SLO is generated in August and a report is created for July, the reliability score
will show N/A
. We suggest running Replay to populate your SLO with historical data in such cases.
I can’t see services/projects in my report
In the auto-generated structure, if you add a service/project to the Reliability Roll-up report and these projects or services don’t contain an SLO, they won’t appear in the Reliability Roll-up report.
In the custom structure, the Reliability Roll-up report displays empty projects and services since they’re treated as separate folders (we assume that users may want to add empty folders).
Those services/folders will not be calculated for the reliability score
.