Nobl9 is designed to be easy for everyone to use, including those without any developer or SRE experience. That said, we acknowledge that SLOs can be complicated! Below, you'll find a list of terms commonly used in Nobl9 to help you easily navigate through our platform.
The Nobl9 agent is a lightweight application that executes the queries defined for Nobl9 SLOs. Users can run the agent to retrieve SLI metrics from their configured data sources and send the data back to the Nobl9 backend. Queries are written in the language supported by the data source in question and executed via native APIs. The agent can be deployed in a Kubernetes cluster or as a Docker container.
When you configure an agent connection to a data source (as opposed to a direct connection - see Direct data source integration), Nobl9 does not make direct calls to your environment. You pass in your credentials when launching the agent, and those credentials are not stored in the Nobl9 backend. Moreover, the Nobl9 agent can be used to collect and return data even if your company's firewall blocks outbound connections.
When an alert is triggered, Nobl9 can automatically send a notification to an external tool, a REST endpoint (web service), or an email address. alert methods can be associated with all available alert integrations.
An alert policy is a set of conditions (triggers) you want to track or monitor. These conditions determine what is monitored and when to activate an alert: when the performance of your service falls under the defined threshold, Nobl9 will send a notification to a predefined channel (depending on the specified alert method).
Each escalation threshold should be represented by a different alert policy with different severity levels (see Severity).
SLO annotations let Nobl9 users add notes to their metrics, which can be displayed in charts, annotation lists, and reports.
Calendar-aligned time windows
Nobl9 allows time windows for SLOs to be defined on a calendar-aligned or rolling basis. Calendar-aligned time windows are bound to specific periods on a calendar: for example, you can calculate your error budget starting at the beginning of each week, calendar month, quarter, or even year. This facilitates time-based reporting on the health of your service—when you tie your error budget to something like a calendar month, people know exactly when the error budget will return in full. Conversely, calendar-aligned time windows can downplay the impact of failures of your service: if your service was down for an entire day toward the end of the month, your users will remember this a few days later, when the new calendar window starts. For this reason, calendar-aligned time windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis.
The cooldown period is an interval measured from the last timestamp when all alert policy conditions were met. If the defined cooldown period passes without those conditions being met again, the alert event is resolved.
Data export is a premium Nobl9 feature that allows users to export their SLO data (the raw time-series budget burndown for all services in an account) to CSV files or directly to a Snowflake warehouse. The data is exported to an S3 bucket defined as a destination once per day.
Direct data source integration
Nobl9 users can choose between a direct or agent configuration (see Agent) when connecting to a data source. A direct connection requires users to enter their authentication credentials (API key, token, etc.), which are encrypted and safely stored in Nobl9. These credentials are then used to connect directly to the external source in order to gather metrics data. The customer does not need to install anything on their server.
The error budget is the portion of requests that can fail over a defined period of time without incurring an SLO violation. It relies on the targets set up in your SLO.
From Implementing Service Level Objectives by Alex Hidalgo:
"An error budget is a way of measuring how your SLI has performed against your SLO over a period of time. It defines how unreliable your service is permitted to be within that period and serves as a signal of when you need to take corrective action."
Error budget burn rate
The burn rate shows how much of the error budget would be used up in the SLO time window if the number of bad events stays the same. For better granularity and immediate understanding of the system performance, the burn rate is always calculated for the last minute in Nobl9 charts.
Depending on the error budget calculation method, burn rate values can indicate the following:
|No bad data points in the last minute
|The last minute was good or no data available
0 < value <
|Last minute contained good and bad data points
|All data points in the last minute were bad
|The last minute was bad
Assuming the number of errors remains constant throughout the SLO time window, we can anticipate the following scenarios based on the burn rate:
The error budget will not be entirely consumed by the end of the time window. This indicates that the error rate is below the target, and the system performs better than expected.
The error budget will be entirely exhausted by the end of the time window. This implies that the error rate is aligned with the target error rate, and the system performs as expected.
The error budget will be exhausted before the time window ends. This indicates that the error rate exceeds the target error rate, and the system is not meeting expectations.
Error budget calculation method
Nobl9 offers two error budget calculation methods: Occurrences and Time Slices. The budgeting method you select determines whether the error budget will be calculated based on the count of good attempts vs. total attempts or the count of good minutes vs. total minutes in the time window defined for your SLO.
The Service Health Dashboard displays information on the health of your services based on the error budgets or burn rates for their defined SLOs. The following are the definitions of the possible statuses:
|Status by error budget (EB)
|Status by burn rate (BR)
|EB: All SLOs in this service have more than 20% of their error budget remaining
BR: All SLOs in this service have burn rate lower than your threshold
|EB: All SLOs in this service still have some remaining error budget, but at least one has less than 20% of its error budget left
BR: At least one SLO in this service has the burn rate equal to your low threshold value or more, but no more than the high threshold value (not including it)
|EB: At least one of the SLOs in this service has already burned its error budget for the current time window, and at least one SLO has less than 20% of its error budget left
BR: At least one SLO in this service has the burn rate equal to the high threshold value or greater
|There is no data available for the service’s SLOs, or their error budgets or burn rates haven’t been calculated yet
An indicator is a unique query that defines a metric in a given data source that describes some property of the observed service. The same indicator can be used in one or more SLOs.
Labels are key-value pairs that can be attached to SLOs, services, and alert policies in the Nobl9 platform. They allow users to define attributes of resources and use them to filter and group SLOs across services in the SLO grid view and reports. Each label must be unique for a given SLO, but many SLOs can carry the same label.
A metric is a formula that uses measurements to determine how well the system performs in a specific situation. SLI metrics in Nobl9 are any two-dimensional sets of data where changes in a certain value are distributed over time. Nobl9 supports two types of metrics: ratio metrics and threshold metrics (see below).
Objectives are the thresholds for your SLOs. Nobl9 users can use objectives to define the tolerance levels for their metrics.
One of the two available error budget calculation methods. With the Occurrences method, we count the number of good attempts out of all attempts made. This method is well suited to measuring recent user experience, and since there are fewer total attempts during low-traffic periods, it automatically adjusts to lower traffic volumes. See also Time Slices.
Organization-level roles ensure access across the Nobl9 platform. Depending on the desired access rights, users can be assigned the Organization Admin, User, or Viewer role:
Organization Admins have full read and write access to all areas in the Nobl9 platform. They are responsible for setting up single sign-on (SSO) and user management.
Organization User is the default role for anyone who signs in to the Nobl9 platform. Organization Users can be granted access to one or more projects by being assigned the role of Project Owner, Editor, Viewer, or Integrations User.
Organization Viewers have read-only access to all resources in the Nobl9 platform.
Projects are the primary logical grouping of resources across the Nobl9 platform. All Nobl9 resources are created within a project. Access controls at the project level let users control who can see and change these resources. The resources that can be grouped under a project include:
Project-level roles entitle users to access a project and its underlying resources, such as services or SLOs. They include:
Project Owners, who have read and write access to the project(s) they own.
Project Editors, who are the primary users of the Nobl9 platform.
Project Viewers, who are the primary consumers of data in the Nobl9 platform.
Project Integrations Users, who can use a data source or an alert method in a given project, but cannot create, edit, or delete project resources.
Query customization variables
Query customization variables improve the consistency and integrity of the incoming data between your data source and Nobl9.
queryDelaydefines the time range offset for data collection.
For example, with a
0mquery delay, the Nobl9 agent calls for data from 14:50–15:00. With a
10mquery delay, the Nobl9 agent calls for data from
N9_DATA-SOURCE-NAME_QUERY_INTERVALdefines how often the Nobl9 agent requests data from your data source.
For example, with a
10mquery interval, the Nobl9 agent queries data for the last 10 minutes every 10 minutes.
N9_DATA-SOURCE-NAME_COLLECTION_JITTERdefines the deviation of request frequency from Nobl9 to your data source.
For example, with a
10mquery interval, Nobl9 requests data every 10 minutes, say, at
Although with a
15sjitter, data is requested at a random point between
N9_DATA-SOURCE-NAME_HTTP_CLIENT_TIMEOUT_DURATIONdefines how long the Nobl9 agent waits for a data source to respond upon sending a query.
For example, with a
15stimeout, if the agent runs a query at
15:00:00, the data source must finish returning data by
15:00:15, otherwise the query fails by timeout.
Query customization variables are available for individual data sources.
You can modify
queryDelay either with the Nobl9 UI or
Other variables are editable with
A ratio metric is an SLI metric composed of two time series that allows you to determine the percentage of "good" events by dividing that number by the total number of events.
As an example, suppose you own a website with roughly 30,000 visitors every day. 29,991 of these visits result in the website loading within the target of 0.5 seconds. Knowing this, you can calculate your ratio metric by dividing the number of good requests (the numerator) by the total number of requests (the denominator) and multiplying it by 100%:
Role-based access control (RBAC) is used in Nobl9 to ensure granular user permissions and access to resources in the Nobl9 platform.
Reliability burn down
The reliability burndown rate is closely related to the error budget status but focuses more closely on the percentage of recent events that have reported a good versus bad status. It measures how your service has performed over time and gives you better data to discuss whether you need to fix or develop your service further.
RoleBinding is a YAML object related to RBAC in Nobl9. A single
RoleBinding object allows the definition of the relation between exactly one user and exactly one role.
Rolling time windows
A rolling time window moves as time progresses. For instance, if you have a 30-day window and a 10-second resolution, your error budget will be updated every 10 seconds as time moves forward. This allows for bad event observations to fall off and no longer be involved in your computations as they move outside that 30-day window.
A service in the Nobl9 platform is something that can be tested for reliability. It can represent a logical service endpoint like an internal or external API, a database, or anything else you care about setting an SLO for, such as a user journey. In Nobl9, services are organized under projects.
Service level indicator (SLI)
A service level indicator is a metric used to determine whether a service achieves the defined service level objective. This could, for example, be the number of successful requests against the service over a given time period when performing performance monitoring.
Service level objective (SLO)
A service level objective is an actual target value (or range of values) for the availability of the service, which is measured by a service level indicator. SLOs allow you to define the reliability of your products and services in terms of customer expectations. Nobl9 users can create SLOs for user journeys, internal services, or even infrastructure.
Each SLO can have one or more defined objectives (targets and values), with an indication of the user experience (e.g., Good or Acceptable) when that target is met.
The severity of an alert policy (see Alert policy) indicates the level of impact of a triggered alert event. Nobl9 users can define the severity level as follows:
High: A critical incident with a very high impact
Medium: A major incident with a significant impact
Low: A minor incident with low impact
sloctl is a command-line interface (CLI) for Nobl9. The
sloctl CLI can be used for creating or updating multiple SLOs and objectives at once as part of CI/CD.
A threshold metric is an SLI metric composed of a single time series that represents a numerical property of a service that changes over time, such as the duration of an average HTTP response or CPU utilization. These values are evaluated against a set threshold. See also Threshold, Threshold target, and Threshold value.
The lowest acceptable good/total ratio in a given time window that will ensure an objective to be considered as "met." For example, suppose you have a latency objective where you want responses to be returned in less than 100 ms (the threshold value; see below). If the target is set to 90%, for the Occurrences error budget calculation method this would be interpreted as “the response time of 90% of requests should be below 100 ms in a given time window.” For the Time Slices method, it would be interpreted as “the response time should be below 100 ms for 90% of the minutes in a given time window.”
This is the value against which a raw indicator is compared to determine if a specific value is "good" or "bad."
One of the two available error budget calculation methods. With the Time slices method, what is counted - i.e., the objective that is measured - is how many good minutes (minutes in which the system was operating within the defined boundaries) were achieved, compared to the total number of minutes in the time window. See also Occurrences.