Service Health Dashboard
Service Health Dashboard (or dashboard) summarizes the reliability of services in your organization. The dashboard targets product managers or executives who do not require a granular view of each SLO and instead are looking for a holistic view of reliability within their organization. Engineers or SREs can also use it to see a snapshot of the current state of their environment and drill down for more information.
The dashboard gives an aggregated view of the overall organizational health. It groups the services into three color-coded categories according to the settings, and uses the project-service-SLO hierarchy.
Service Health Dashboard features two views for more comprehensive analysis:
- By error budget
- By burn rate
In this view, services are grouped by the severity of error budget exhaustion calculated over the current time window.
You can select the way the services are displayed on the dashboard. Choose the required option in the View list under the dashboard header. The following options are available:
- Circles
- Hexagons
- Circles with icons
In this view, services are grouped by their burn rate value calculated over the latest period, set by the time window.
Monitoring health of services
The color-coded categories mean the available error budget of the SLOs or their burn rate,
depending on the dashboard type.
The SLOs a given service includes determine the category this service falls into.
-
+ green circle: healthy services.
The SLOs these services contain have enough error budget or a low burn rate. -
+ orange circle: services trending in the wrong direction.
These services SLOs have less error budget or a medium burn rate. -
+ red circle: problematic services.
Services in this category include SLOs with the least error budget or a highest burn rate. -
+ grey circle: services with no data available for the selected time window.
If at least one SLO in a given service meets the category's criteria,
this service is located under the matching category on the dashboard.
Services are displayed on the dashboard under the projects
they relate to.
When a project contains services that fall into different categories,
you see this project under all matching categories.
A service can fall into the No data category when any of its SLOs has combined query interval and query delay values greater or equal to the dashboard's time window. This makes the time window too short for categorization—Nobl9 lacks time to accumulate data. Learn more about data source and SLO troubleshooting.
To narrow down the number of No data services, try increasing the offset.
Service health by error budget
Service health by error budget shows the health of your services from the perspective of their error budget.
The groupings are based on the error budget exhaustion severity and are as follows:
-
Healthy: All SLOs in this service have more than 20% of the error budget still available.
-
At risk: All SLOs in this service still have an available error budget, and at least one SLO for this service has less than 20% of the error budget left. For example, Service X is At risk because SLO C and SLO D are under 20%:
Service X | Remaining error budget |
---|---|
SLO A | 84% |
SLO B | 45% |
SLO C | 9% |
SLO D | 19% |
SLO E | 95% |
- Exhausted: At least one of the SLOs in this service has burnt its error budget in the current time window, and at least one SLO for this service has less than 20% of the error budget left.
For example, Service Y is Exhausted because SLO B has already burnt its error budget in a specified time window:
Service Y | Remaining error budget |
---|---|
SLO A | 84% |
SLO B | -55% |
SLO C | 9% |
SLO D | 19% |
SLO E | 95% |
- No data: There is no data available for the service’s SLOs, or the error budget hasn’t been calculated yet.
Error budget thresholds
You can modify the error budget thresholds to evaluate the health of your services from a different perspective.
For this, click Define thresholds under the service percentage scale and set the required values. As a result, the dashboard changes the service breakdown, and the deep link to the dashboard contains your updated values. So you can save and share it further.
Your changes are only applied to your current session and aren't visible to other users. Once you switch the dashboard, navigate to any other Nobl9 section, or log out, it returns to its default values.
When your role permissions allow, you can set new defaults for the entire organization. For this, select Save as default in my organization before applying your changes.
Service health by burn rate
Service Health by burn rate uses thresholds for Low, Medium, and High burn rates. Services are evaluated over the specified time window and fall into these categories, according the thresholds set.
The dashboard is dynamic and refreshes every minute.
The dashboard displays the maximum and average burn rate values for the whole organization and per category.
- The Maximum is the highest burn rate among the SLOs.
- The Average is the arithmetic mean of burn rates across the SLOs.
Burn rate time window and thresholds
You can modify the time window and thresholds to access the health of your services from different perspectives.
Upon time window and thresholds editing, the dashboard changes the service breakdown, and the deep link to the dashboard contains your updated values.
Your changes are only applied to your current session and aren't visible to other users. Once you switch the dashboard, navigate to any other Nobl9 section, or log out, it returns to its default values.
To check the current defaults with the modified time window or threshold values, hover the cursor over Reset to defaults.
When your role permissions allow, you can change the default time window, offset, and thresholds for the entire organization. For this, select Save as default in my organization before applying your changes.
Time window
The time window is the interval of the burn rate evaluation. To modify it, click Edit window.
Relatively narrow time windows can be the reason for services falling into the No data category.
This occurs because SLOs in such services bring data for a given period after the time window rolls further, leaving no time for collection and categorization.
Since the frequency of data collection depends on SLO query interval and query delay (if any),
increasing the time window can address this issue.
So,
set the time window greater than the longest combined query interval and query delay among SLOs in the affected service.
This gives enough time to collect data and categorize the service, preventing it from falling into No data.
Another remedy for this is adding the offset to the time window. Offset moves the time window without impacting its range. For example, when the time window is 5 minutes with 1-minute offset, the dashboard will still ask for 5-minute data; however, the time window shifts for 1 minute backwards, i.e:
- Time window = 5 min; now is 15:00
- Offset = 0 min
The dashboard asks for data received from 14:55-15:00 - Offset = 1 min
The dashboard asks for data received from 14:54-14:59
- Offset = 0 min
While offset gives a headstart for SLOs with low querying frequency, it lets old data from frequently updated SLOs at the same time.
Thresholds
The thresholds determine the category criteria. To set other thresholds, click Define thresholds:
Based on the thresholds set in Image 3, the dashboard groups services as follows:
- The burn rate increases 15-fold or higher—this service falls into the High burn rate category
- The burn rate increases 5- to 15-fold—this service falls into the Medium burn rate category
- The burn rate doesn't increase, or its increase doesn't reach 5-fold—this service falls into the Low burn rate category
Accessing service details
To view the details of a service on the dashboard, click the circle with the required service. It opens the list of SLOs this service holds ordered by remaining error budget or burn rate.
-
For standard SLOs and default sorting, Nobl9 displays the most alarming SLO first:
- Service Health by error budget: the SLO with the least remaining error budget
- Service Health by the burn rate: the SLO with the highest burn rate
-
For composite SLOs
, the list starts with one having the least remaining budget or highest burn rate.
Click the SLO to open its details page:
Dashboard filtering and sorting
You can filter the data displayed on the Service Health Dashboard to see the services of your interest:
- Total
- Exhausted / High burn rate
- At risk / Medium burn rate
- Healthy / Low burn rate
- No data
or any combination of them.
For this, click the required tile or tiles in the dashboard header:
You can also filter the dashboards with labels. For this, enter labels added to services, projects, and SLOs. All applied filters are persisted in the dashboard URL. Use them as a deep link to a filtered view.
Read more about SLO search and filter logic.
Sorting
You can sort the resources on the dashboard alphabetically, by name (not considering SLO, service, or project state):
- In an ascending (A-Z) order.
- In a descending (Z-A) order.
This way, you sort all resources on the dashboard: projects, services within projects, and SLOs within services, regardless of their state.
- Service Health by error budget
- Service Health by burn rate
By default, projects are sorted by State. As a result, the display is as follows (in the left-to-right direction):
- Projects with the highest amount of Exhausted services are displayed first.
- Then the dashboard displays projects with the highest number of At risk services.
- Projects with all Healthy services are displayed last.
- Services with the highest amount of Exhausted SLOs are displayed first.
- They are followed by the services with the highest number of At risk SLOs.
- The Healthy services are shown last.
By default, the services are sorted by the highest burn rate.
When sorting by Lowest first, the dashboard assembles the groupings from the Low to High burn rate, showing the services with the lowest burn rate first within every grouping.