Skip to main content

Service Health Dashboard

Reading time: 0 minute(s) (0 words)

Service Health Dashboard (or dashboard) summarizes the reliability of services in your organization. The dashboard targets product managers or executives who do not require a granular view of each SLO and instead are looking for a holistic view of reliability within their organization. Engineers or SREs can also use it to see a snapshot of the current state of their environment and drill down for more information.

The dashboard gives an aggregated view of the overall organizational health. It groups the services into three color-coded categories according to the settings, and uses the project-service-SLO hierarchy.

Service Health Dashboard features two views for more comprehensive analysis:

In this view, services are grouped by the severity of error budget exhaustion calculated over the current time window.

Service Health by error budget
Service Health Dashboard: Error budget

You can select the way the services are displayed on the dashboard. Choose the required option in the View list under the dashboard header. The following options are available:

  • Circles
  • Hexagons
  • Circles with icons

Monitoring health of services

The color-coded categories mean the available error budget of the SLOs or their burn rate, depending on the dashboard type.
The SLOs a given service includes determine the category this service falls into.

  • available icon + green circle: healthy services.
    The SLOs these services contain have enough error budget or a low burn rate.

  • + orange circle: services trending in the wrong direction.
    These services SLOs have less error budget or a medium burn rate.

  • + red circle: problematic services.
    Services in this category include SLOs with the least error budget or a highest burn rate.

  • + grey circle: services with no data available for the selected time window.

If at least one SLO in a given service meets the category's criteria, this service is located under the matching category on the dashboard.
Services are displayed on the dashboard under the projects they relate to. When a project contains services that fall into different categories, you see this project under all matching categories.

No data: time window vs. query parameters

A service can fall into the No data category when any of its SLOs has combined query interval and query delay values greater or equal to the dashboard's time window. This makes the time window too short for categorization—Nobl9 lacks time to accumulate data. Learn more about data source and SLO troubleshooting.

To narrow down the number of No data services, try increasing the offset.

Service health by error budget

Service health by error budget shows the health of your services from the perspective of their error budget.

The groupings are based on the error budget exhaustion severity and are as follows:

  • Healthy: All SLOs in this service have more than 20% of the error budget still available.

  • At risk: All SLOs in this service still have an available error budget, and at least one SLO for this service has less than 20% of the error budget left. For example, Service X is At risk because SLO C and SLO D are under 20%:


Service XRemaining error budget
SLO A84%
SLO B45%
SLO C 9%
SLO D19%
SLO E95%
  • Exhausted: At least one of the SLOs in this service has burnt its error budget in the current time window, and at least one SLO for this service has less than 20% of the error budget left.

For example, Service Y is Exhausted because SLO B has already burnt its error budget in a specified time window:

Service YRemaining error budget
SLO A84%
SLO B-55%
SLO C 9%
SLO D19%
SLO E95%
  • No data: There is no data available for the service’s SLOs, or the error budget hasn’t been calculated yet.

Error budget thresholds

You can modify the error budget thresholds to evaluate the health of your services from a different perspective.

For this, click Define thresholds under the service percentage scale and set the required values. As a result, the dashboard changes the service breakdown, and the deep link to the dashboard contains your updated values. So you can save and share it further.

Define error budget thresholds
Image 1: Define error budget thresholds

Your changes are only applied to your current session and aren't visible to other users. Once you switch the dashboard, navigate to any other Nobl9 section, or log out, it returns to its default values.

When your role permissions allow, you can set new defaults for the entire organization. For this, select Save as default in my organization before applying your changes.

Service health by burn rate

Service Health by burn rate uses thresholds for Low, Medium, and High burn rates. Services are evaluated over the specified time window and fall into these categories, according the thresholds set.
The dashboard is dynamic and refreshes every minute.

The dashboard displays the maximum and average burn rate values for the whole organization and per category.

  • The Maximum is the highest burn rate among the SLOs.
  • The Average is the arithmetic mean of burn rates across the SLOs.

Burn rate time window and thresholds

You can modify the time window and thresholds to access the health of your services from different perspectives.

Upon time window and thresholds editing, the dashboard changes the service breakdown, and the deep link to the dashboard contains your updated values.

Your changes are only applied to your current session and aren't visible to other users. Once you switch the dashboard, navigate to any other Nobl9 section, or log out, it returns to its default values.

To check the current defaults with the modified time window or threshold values, hover the cursor over Reset to defaults.

When your role permissions allow, you can change the default time window, offset, and thresholds for the entire organization. For this, select Save as default in my organization before applying your changes.

Time window

The time window is the interval of the burn rate evaluation. To modify it, click Edit window.

Time window vs. No data

Relatively narrow time windows can be the reason for services falling into the No data category.

This occurs because SLOs in such services bring data for a given period after the time window rolls further, leaving no time for collection and categorization.

Since the frequency of data collection depends on SLO query interval and query delay (if any), increasing the time window can address this issue.
So, set the time window greater than the longest combined query interval and query delay among SLOs in the affected service.
This gives enough time to collect data and categorize the service, preventing it from falling into No data.

Another remedy for this is adding the offset to the time window. Offset moves the time window without impacting its range. For example, when the time window is 5 minutes with 1-minute offset, the dashboard will still ask for 5-minute data; however, the time window shifts for 1 minute backwards, i.e:

  • Time window = 5 min; now is 15:00
    • Offset = 0 min
      The dashboard asks for data received from 14:55-15:00
    • Offset = 1 min
      The dashboard asks for data received from 14:54-14:59
Time window and offset
Image 2: Edit time window
note

While offset gives a headstart for SLOs with low querying frequency, it lets old data from frequently updated SLOs at the same time.

Thresholds

The thresholds determine the category criteria. To set other thresholds, click Define thresholds:

Thresholds
Image 3: Define burn rate thresholds

Based on the thresholds set in Image 3, the dashboard groups services as follows:

  • The burn rate increases 15-fold or higher—this service falls into the High burn rate category
  • The burn rate increases 5- to 15-fold—this service falls into the Medium burn rate category
  • The burn rate doesn't increase, or its increase doesn't reach 5-fold—this service falls into the Low burn rate category

Accessing service details

To view the details of a service on the dashboard, click the circle with the required service. It opens the list of SLOs this service holds ordered by remaining error budget or burn rate.

  • For standard SLOs and default sorting, Nobl9 displays the most alarming SLO first:

    • Service Health by error budget: the SLO with the least remaining error budget
    • Service Health by the burn rate: the SLO with the highest burn rate
  • For composite SLOs

    , the list starts with one having the least remaining budget or highest burn rate.

Click the SLO to open its details page:

SLO details
Image 4: Accessing SLO details

Dashboard filtering and sorting

You can filter the data displayed on the Service Health Dashboard to see the services of your interest:

  • Total
  • Exhausted / High burn rate
  • At risk / Medium burn rate
  • Healthy / Low burn rate
  • No data

or any combination of them.

For this, click the required tile or tiles in the dashboard header:

Applying filters on the Service Health Dashboard
Image 5: Applying filters on the Service Health Dashboard

You can also filter the dashboards with labels. For this, enter labels added to services, projects, and SLOs. All applied filters are persisted in the dashboard URL. Use them as a deep link to a filtered view.

Read more about SLO search and filter logic.

Sorting
You can sort the resources on the dashboard alphabetically, by name (not considering SLO, service, or project state):

  • In an ascending (A-Z) order.
  • In a descending (Z-A) order.

This way, you sort all resources on the dashboard: projects, services within projects, and SLOs within services, regardless of their state.

By default, projects are sorted by State. As a result, the display is as follows (in the left-to-right direction):

  • Projects with the highest amount of Exhausted services are displayed first.
  • Then the dashboard displays projects with the highest number of At risk services.
  • Projects with all Healthy services are displayed last.
The same rules apply to how the individual services are ordered in a project (from top to bottom):
  • Services with the highest amount of Exhausted SLOs are displayed first.
  • They are followed by the services with the highest number of At risk SLOs.
  • The Healthy services are shown last.

For a more in-depth look, consult additional resources: