Nobl9 is designed to be easy for everyone to use, including those without any developer or SRE experience. That said, we acknowledge that SLOs can be complicated! Below, you'll find a list of terms commonly used in Nobl9 to help you easily navigate through our platform.
The Nobl9 Agent is a lightweight application that executes the queries defined for Nobl9 SLOs. Users can run the Agent to retrieve SLI metrics from their configured data sources and send the data back to the Nobl9 backend. Queries are written in the language supported by the data source in question and executed via native APIs. The Agent can be deployed in a Kubernetes cluster or as a Docker container.
When you configure an Agent connection to a data source (as opposed to a Direct connection - see Direct Data Source Integration below), Nobl9 does not make direct calls to your environment. You pass in your credentials when launching the Agent, and those credentials are not stored in the Nobl9 backend. Moreover, the Nobl9 Agent can be used to collect and return data even if your company's firewall blocks outbound connections.
When an alert is triggered, Nobl9 can automatically send a notification to an external tool, a REST endpoint (web service), or an email address. Alert methods can be associated with all available alert integrations.
An alert policy is a set of conditions (triggers) you want to track or monitor. These conditions determine what is monitored and when to activate an alert: when the performance of your service falls under the defined threshold, Nobl9 will send a notification to a predefined channel (depending on the specified alert method).
Each escalation threshold should be represented by a different alert policy with different severity levels (see Severity below).
SLO annotations enable Nobl9 users to add notes to their metrics which can be displayed in charts, annotation lists, and reports.
Calendar-Aligned Time Windows
Nobl9 allows time windows for SLOs to be defined on a calendar-aligned or rolling basis. Calendar-aligned time windows are bound to specific periods on a calendar: for example, you might calculate your error budget starting at the beginning of each week, calendar month, quarter, or even year. This facilitates time-based reporting on the health of your service - when you tie your error budget to something like a calendar month, people know exactly when the error budget will return in full. Conversely, calendar-aligned time windows can downplay the impact of failures of your service: if your service was down for an entire day toward the end of the month, your users will remember this a few days later, when the new calendar window starts. For this reason, calendar-aligned time windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis.
See also Rolling Time Windows.
The cooldown period is an interval measured from the last timestamp when all alert policy conditions were met. If the defined cooldown period passes without those conditions being met again, the alert event is resolved.
Data export is a premium Nobl9 feature that allows users to export their SLO data (the raw time-series budget burndown for all services in an account) to .csv files or directly to a Snowflake warehouse. The data is exported to an S3 bucket defined as a destination once per day.
Direct Data Source Integration
Nobl9 users can choose between a Direct or Agent configuration (see Agent above) when connecting to a data source. A Direct connection requires users to enter their authentication credentials (API key, token, etc.), which are encrypted and safely stored in Nobl9. These credentials are then used to connect directly to the external source in order to gather metrics data. The customer does not need to install anything on their server.
The error budget is the portion of requests that can fail over a defined period of time without incurring an SLO violation. It relies on the targets set up in your SLO.
From Implementing Service Level Objectives by Alex Hidalgo:
"An error budget is a way of measuring how your SLI has performed against your SLO over a period of time. It defines how unreliable your service is permitted to be within that period and serves as a signal of when you need to take corrective action."
Error Budget Burn Rate
The burn rate measures how fast the error budget is being consumed. It is calculated via the following mathematical operation:
The result of this operation is a real number:
If this number is > 1, your error budget is being consumed faster than you are allowing for. Without corrective action, it will be exhausted before the end of the current time window.
If this number is < 1, you should have some error budget remaining at the end of the current time window.
Error Budget Calculation Method
Nobl9 offers two error budget calculation methods: Occurrences and Time Slices. The budgeting method you select determines whether the error budget will be calculated based on the count of good attempts vs. total attempts or the count of good minutes vs. total minutes in the time window defined for your SLO.
See also Occurrences and Time Slices.
The Service Health Dashboard displays information on the health of your services based on the error budgets for their defined SLOs. The following are the definitions of the possible statuses:
Healthy: All SLOs in this service have more than 20% of their error budget remaining.
At Risk: All SLOs in this service still have some remaining error budget, but at least one has less than 20% of its error budget left.
Exhausted: At least one of the SLOs in this service has already burned its error budget for the current time window, and at least one SLO has less than 20% of its error budget left.
No Data: There is no data available for the service’s SLOs, or their error budgets haven’t been calculated yet.
An indicator is a unique query that defines a metric in a given data source that describes some property of the observed service. The same indicator can be used in one or more SLOs.
Labels are key-value pairs that can be attached to SLOs, services, and alert policies in the Nobl9 platform. They allow users to define attributes of resources and use them to filter and group SLOs across services in the SLO grid view and reports. Each label must be unique for a given SLO, but many SLOs can carry the same label.
A metric is a formula that uses measurements to determine how well the system performs in a specific situation. SLI metrics in Nobl9 are any two-dimensional sets of data where changes in a certain value are distributed over time. Nobl9 supports two types of metrics: Ratio Metrics and Threshold Metrics (see below).
Objectives are the thresholds for your SLOs. Nobl9 users can use objectives to define the tolerance levels for their metrics.
One of the two available error budget calculation methods. With the Occurrences method, we count the number of good attempts out of all attempts made. This method is well suited to measuring recent user experience, and since there are fewer total attempts during low-traffic periods, it automatically adjusts to lower traffic volumes. See also Time Slices.
Organization-level roles enable access across the Nobl9 platform. Depending on the desired access rights, users can be assigned the Organization Admin, User, or Viewer role:
Organization Admins have full read and write access to all areas in the Nobl9 platform. They are responsible for setting up single sign-on (SSO) and user management.
Organization User is the default role for anyone who signs in to the Nobl9 platform. Organization Users can be granted access to one or more projects by being assigned the role of Project Owner, Editor, Viewer, or Integrations User.
Organization Viewers have read-only access to all resources in the Nobl9 platform.
Projects are the primary logical grouping of resources across the Nobl9 platform. All Nobl9 resources are created within a project. Access controls at the project level enable users to control who can see and change these resources. The resources that can be grouped under a project include:
Project-level roles enable users to access a project and its underlying resources, such as services or SLOs. They include:
Project Owners, who have read and write access to the project(s) they own.
Project Editors, who are the primary users of the Nobl9 platform.
Project Viewers, who are the primary consumers of data in the Nobl9 platform.
Project Integrations Users, who can use a data source or an alert method in a given project, but cannot create, edit, or delete project resources.
A ratio metric is an SLI metric composed of two time series that allows you to determine the percentage of "good" events by dividing that number by the total number of events.
As an example, suppose you own a website with roughly 30,000 visitors every day. 29,991 of these visits result in the website loading within the target of 0.5 seconds. Knowing this, you can calculate your ratio metric by dividing the number of good requests (the numerator) by the total number of requests (the denominator) and multiplying it by 100%:
Role-Based Access Control (RBAC) is used in Nobl9 to enable granular user permissions and access to resources in the Nobl9 platform.
Reliability Burn Down
The reliability burndown rate is closely related to the error budget status but focuses more closely on the percentage of recent events that have reported a good versus bad status. It measures how your service has performed over time and gives you better data to discuss whether you need to fix or develop your service further.
RoleBinding is a YAML object related to RBAC in Nobl9. A single
RoleBinding object allows the definition of the relation between exactly one user and exactly one role.
Rolling Time Windows
A rolling time window moves as time progresses. For instance, if you have a 30-day window and a 10-second resolution, your error budget will be updated every 10 seconds as time moves forward. This allows for bad event observations to fall off and no longer be involved in your computations as they move outside that 30-day window.
A service in the Nobl9 platform is something that can be tested for reliability. It can represent a logical service endpoint like an internal or external API, a database, or anything else you care about setting an SLO for, such as a user journey. In Nobl9, services are organized under projects.
Service Level Indicator (SLI)
A Service Level Indicator is a metric used to determine whether a service achieves the defined Service Level Objective. This could, for example, be the number of successful requests against the service over a given time period when performing performance monitoring.
Service Level Objective (SLO)
A Service Level Objective is an actual target value (or range of values) for the availability of the service, which is measured by a Service Level Indicator. SLOs allow you to define the reliability of your products and services in terms of customer expectations. Nobl9 users can create SLOs for user journeys, internal services, or even infrastructure.
Each SLO can have one or more defined objectives (targets and values), with an indication of the user experience (e.g., Good or Acceptable) when that target is met.
The severity of an alert policy (see Alert Policy above) indicates the level of impact of a triggered alert event. Nobl9 users can define the severity level as follows:
High: A critical incident with a very high impact
Medium: A major incident with a significant impact
Low: A minor incident with low impact
sloctl is a command-line interface (CLI) for Nobl9. The
sloctl CLI can be used for creating or updating multiple SLOs and objectives at once as part of CI/CD.
A threshold metric is an SLI metric composed of a single time series that represents a numerical property of a service that changes over time, such as the duration of an average HTTP response or CPU utilization. These values are evaluated against a set threshold. See also Threshold, Threshold Target, and Threshold Value.
The lowest acceptable good/total ratio in a given time window that will enable an objective to be considered as "met." For example, suppose you have a latency objective where you want responses to be returned in less than 100 ms (the threshold value; see below). If the target is set to 90%, for the Occurrences error budget calculation method this would be interpreted as “the response time of 90% of requests should be below 100 ms in a given time window." For the Time Slices method, it would be interpreted as "the response time should be below 100 ms for 90% of the minutes in a given time window."
This is the value against which a raw indicator is compared to determine if a specific value is "good" or "bad."
One of the two available error budget calculation methods. With the Time Slices method, what is counted - i.e., the objective that is measured - is how many good minutes (minutes in which the system was operating within the defined boundaries) were achieved, compared to the total number of minutes in the time window. See also Occurrences.