Service level objectives
Our services may be small or incredibly deep and complex, but almost without fail these services can no longer be properly understood via the logs or stack traces we have depended on in the past. With this shift, we need not just new types of telemetry, but also new approaches for using that telemetry.
A Practical Guide to SLIs, SLOs & Error Budgets by Alex Hidalgo
The core concept of performance tracking is a service level objective (an SLO). It refers to the desired performance of your serviceโthe level you consider acceptable. In other words, you use SLOs to measure the reliability of your service.
SLOs exist along with two other concepts:
- Service level indicators (SLIs or objectives)โa quantifiable metrics that measure a specific aspect of your service's performance level.
Assessing your service performance, you monitor whether your SLIs satisfy your SLO. - Error budgetโthe acceptable number of failures you can have while still achieving your desired performance target.
It shows how close your reliability is towards your SLO over some period of time.
Considering the above-mentioned, an SLO unit refers to the number of unique error budgets Nobl9 calculates using the following:
- Data received from your data source
- Your configured target
This means every SLO is connected to a data source and has at least one error budget. Every additional SLO target is considered an additional error budget.
Nobl9 simplifies SLO development and management with its comprehensive features, from integrating your preferred data source, through SLO creation, to alerting and reports.
Using SLOs, you measure individual aspects of your serviceโthe latency of authorization, the number of successful registrations, or anything else you need to monitor.
When you need to monitor the reliability of your complex system end-to-end, you can assemble multiple SLOs into a single composite.
Create an SLOโ
Nobl9 lets you create SLOs in the following ways:
- On the Nobl9 Web:
- Following the steps in the SLO wizard accessible from the SLO grid
- Using SLI analyzer
- Using the SLOs as Code tools:
SLO name (in contrast to its display name)
is a unique identifier of your SLO.
While you can edit an SLO's display name at any time,
its name cannot be edited on the Nobl9 Web once you save the SLO.
The only way to modify it is the sloctl get slos
command in sloctl
.
Configuring an SLO, you specify an SLIโa metric for Nobl9 to pull from your data source. Depending on your metrics source, it is specified as a query or a set of parameters.
For example, you can query the following:
Service type | Ask for |
---|---|
A web service or API | HTTPS responses with 2xx and 3xx status codes |
A queue consumer | Successful processing of a message |
Serverless and function-based architectures | Successful completion of an invocation |
A batch | Normal exit (for example, rc == 0) of the driving process or script |
A browser application | Completion of a user action without yamlScript errors |
Create a composite SLOโ
Check this guide to see detailed instructions how to create composite SLOs 2.0.
SLO gridโ
Once created, your SLO appears in the SLO grid, in the Service Level Objectives section on the Nobl9 Web.
This is a central board of your SLOs. You can do the following:
- View all SLOs in the organization.
You can see SLOs enclosed in projects you have access to. - View SLO live graphs, rewind them, fast-forward up to the current time, and pause.
- Search and filter SLOs.
- View SLO charts from the perspective of different time windows and time zones
Any changes to the time zone made on the SLO grid apply to this SLO details page.
If you suspect an issue with an SLO, first verify its underlying query. For queries that seem accurate, the problem might lie with the data source itself. In that case, activate event logs for your data source to pinpoint errors and identify the number of impacted SLOs.
SLO detailsโ
Click the required SLO to open its details. Here you can manage and assess your SLO.
The following options are available in your SLO header row:
Button | Action | Notes |
---|---|---|
Expand SLO metadata | SLO metadata includes the following: | |
Copy link to the SLO | Link is copied along with the time window and time zone | |
Edit the SLO | SLO wizard opens | |
Open the Options menu | ||
Open the More actions menu |
1Charts settings apply to all SLOs per user. So, having set the chart visibility for one SLO, you'll see the same for all SLOs. When no chart is selected, the message Select at least one chart to see data appears in place of charts.
2Run Replay is inactive when the maximum period for historical data retrieval is set for this data source to 0
You can also handle the time window parameters:
- Shift SLO time window, change the time zone, and copy the time window
The default time zone matches the time zone set in SLO grid - Zoom in SLO charts to access SLI raw data
To zoom in on a specific time range, click and drag on the desired area of the chart. You can drag in both directions.
Highlightsโ
In total, the SLO details page features three tabs. They organize the focus areas of your SLO and are as follows:
- Overview, with the focus on the primary objective
- Objectives, with the charts for all objectives in your SLO
- Alerts, comprising your SLO-related alerts
Once you open your SLO details, you land on the Overview tab. It highlights the reliability values and charts of the primary objective of your SLO.
The primary objective is an objective that takes center stage on the SLO details page.
You can access its detailed information
immediately
upon opening your SLO details.
The primary objective is labeled with
The tiles on the Overview and Objectives tabs display the reliability values of your objective that is currently in focus. The tiles provide a snapshot of the most recent data, focusing on the last seven days of the chosen time window. Here, you can find the following:
- The error budget remaining (in percent)
- The burn rate
- The reliability target, along with the current reliability value
- The number of active alerts within the current time window
Below the tiles, the charts visualize the reliability parameters of the objective currently in focus. The charts cover the entire time window selected and include the following:
- The error budget remaining
- The reliability burn down
- The service level indicator
- The error budget burn rate
Under every objective name, you can find its summary:
- Target: the percentage of acceptable performance you're aiming for.
- Total error budget shows how much of the error budget this objective has within the time window.
- Value indicates which values you considered acceptable for this objective using one of the indicators: less than, less than or equal to, greater than, greater than or equal to.
- Type: the metric typeโratio or threshold.
You can also view objective's underlying metric settings.
To access general SLO metadata, click (unfold) before the SLO name. The metadata include the following:
- Parent project, service, and data source
- SLO history: who created this SLO and the dates of creation and last update
-
For newly created SLOs, and when no primary objective is set for an SLO, Nobl9 displays the lowest-target objective under the Overview tab.
-
The reliability target always shows the actual value, regardless of the time window selected.
-
The Active alerts tile always shows the real-time number of active alerts.
Pausing the SLO also pauses the live updates of active alerts. In this case, the tile shows alerts that were active at the moment of pausing. -
For time windows shorter than seven days, the tiles and charts capture the entire time window.
-
Reliability target changes.
Values in both tiles are calculated based on the most recent data within the time window selected.
Since the Target value is always an actual target, if you increase it, there can be a moment when the Reliability tile is red even with the sufficient value and enough error budget remaining in percent. And vice versa: the reliability can be too low, and the error budget very little, but the Reliability tile can turn green if you decrease the target low enough.
This can happen for a relatively short time after reliability target modification because Nobl9 recalculates the values in both tiles, considering the new target, after the following data income.
This time range remains in SLO history, so when you rewind the time window, you will still see it unless you change the target again or replay1 the SLO.
1The maximum period for historical data retrieval limit per data source is applied.
SLO alertsโ
Open this tab to access alert policies linked to your SLO and check triggered alerts, if any.
The number next to the tab indicates how many alerts are currently active.
Tiles display the alert policies linked to your SLO.
Every alert policy tile comprises a short summary:
- The alert policy name and severity
- Whether it triggered any alerts and when, if yes
- The option to silence or resume alerts
Depending on alert status, alert policies are marked as follows:
Status | Description |
---|---|
Currently alerting | |
Alert resolved | |
Alert silenced | |
No icon | No alert triggered |
Click the required alert policy name to open its details.
To silence an alert, click Silence in its tile. The alert you originally intended to silence is marked for silencing. Select the silence duration and click Silence to confirm.
Under the Silenced alerts section, the currently silenced alerts are listed. You can resume currently silenced alerts, silence any other alerts, or resume all alerts at once.
Under the tiles, the Alerts list shows alerts triggered per SLO objective. Nobl9 limits displaying alerts to 1000 most recent alert events.
By default, you see alerts that have been active within your current time window. The newest alerts are displayed first.
You can filter the list by the following criteria:
- Alert status: All alerts, Triggered, Resolved
- SLO objective name
- Alert policy name
- Time window
When you filter by two or more criteria, the results satisfy all of themโNobl9 applies the AND
logical operator.
Click the required alert to check its details.
Nobl9 returns alerts only for existing alert policies, SLOs, services, and objectives. So, Nobl9 won't return alerts in the following situations:
- If you delete an SLO, alert policy, or service
- If you delete an SLO, alert policy, or service and recreate it with the same name
- If you unlink an alert policy from an SLO
Change history Enterpriseโ
The Change History tab in the SLO details view provides a comprehensive log of all changes made to an SLO. This feature allows you to track modifications with timestamps, the user who performed the change, and the tool used to execute it.
This functionality is available only for enterprise-tier accounts.
Event typesโ
The following events are tracked in the Change history:
Event category | Event type |
---|---|
SLO management events | SLO created |
SLO updated | |
Annotation events | Annotation created |
Annotation updated | |
Annotation deleted | |
Alert silence events | Alert silence created |
Alert silence updated |
Tool typesโ
Change history also allows you to identify the source tool for each change. Some of the visible statuses may not apply to SLOs (see the note below the table):
Source Tool | Description |
---|---|
Agent | N/A1 |
sloctl | Change was applied through sloctl . |
Browser | Changed was applied on the Nobl9 Web. |
Terraform | Changed was applied through the Nobl9 Terraform Provider |
System | Changed through internal system-generated processes. |
SCIM | N/A 1 |
SDK | Changed through the Nobl9 SDK for Go!. |
Change history and RBACโ
The Change history tab is available to all users who have access to the SLO.