Welcome, and thank you for choosing Nobl9. This Getting Started guide shows you how to start using the Nobl9 platform. You’ll learn to access your account, connect to metrics, and create your first Service Level Objective (SLO).
Setting Up User Account
As a Nobl9 user, you will receive an invitation email with an activation link.
If you were invited to Nobl9 and did not receive an invitation email, contact Nobl9 Support.
To set up your account:
Locate the Nobl9 user invitation sent to your email.
Click the link to accept the invitation and follow the instructions to create your account. A confirmation page will appear asking you to return to the login screen.
Logging in to the Nobl9 User Interface
You will need to log in to the Nobl9 web user interface (UI) using your credentials:
Visit Nobl9 App in your browser.
Enter your email address and the password you created during the account setup process, or click “Login with Google” if you have a single sign-on (SSO) account.
Defining a Data Source and Running the Nobl9 Agent
Running data collection through an Agent means that special inbound access to your network is not needed and Nobl9 doesn't have to store your credentials to other metrics systems.
Use the following steps to define a data source and run a Nobl9 Agent:
Navigate to the Integrations section of the web UI.
Click the button on the Sources tab to define a data source.
Select Agent for the connection type and configure the data source.
Click the Add Data Source button.
Follow the on-screen instructions to run the Agent. A Kubernetes configuration and a simple Docker
runcommand will be generated for you. Recommendations:
Run the Agent(s) in a production cluster or in a location that can access production metrics.
Consider running the Agent in your local Docker environment at first for ease of troubleshooting.
Using Resources in Nobl9
This section provides an overview of the different types of resources in the Nobl9 platform. Refer to the YAML Guide to see how Nobl9 resource configuration is represented in the
sloctl API, and how you can express these resources in .yaml format.
The following diagram shows five easy steps for getting started with Nobl9. The remainder of this section introduces the different resources shown here.
Projects are the primary logical grouping of resources in the Nobl9 platform. All Nobl9 resources, such as data sources, SLOs, and alerts, are created within a project. Access controls at the project level enable users to control who can see and change these resources. For example, you can allow all of your users to view the SLOs in a given project, but only a few users to make changes.
Before you can start creating SLOs, you have to create a project to put them in. To create a project:
Go to Catalog > Projects.
Click the button.
Enter a Display name (optional). The Name field will automatically be populated with a Kubernetes-style name, which you can modify if you like. We use this in our YAML configurations to ensure the uniqueness of object names.
Add an optional Description.
Click the Create Project button.
Services in Nobl9 are organized under projects. A service can represent a logical service endpoint like an internal or external API, a database, or anything else you care about setting an SLO for, such as a user journey. Put differently, a Service in the Nobl9 platform is something that you want to test for reliability.
A service may also include other services. For example, in a service desk application, one service might create a new ticket. That service might rely on a user service, a queue, a notification service, and a database service, all of which could be defined as additional services in Nobl9.
When adding a service, you can use labels to add metadata such as team ownership or upstream/downstream dependencies. Services can either be manually added via the user interface or YAML or automatically discovered from a data source based on rules.
A service can have one or more SLOs defined for it. Every SLO created in Nobl9 must be tied to a service.
Data sources aggregate data for your services. Nobl9 allows you to have more than one data source defined for each of your services.
You can configure Nobl9 to connect to these to collect the service data in real-time. Two connection methods are supported:
Use the Direct method if you want Nobl9 to access your server by connecting directly over the internet. This method may be less secure as you will need to open the port the data source is running on for Nobl9 to connect.
Use the Agent configuration if you want to run an Agent alongside your server. You will not need to directly expose your server to Nobl9; the Agent will periodically connect to Nobl9 using an outbound connection.
Service Level Objectives
With a service and its data sources configured, you can define the thresholds for Service Level Indicators. Together with a time window, these create a unique SLO.
SLOs allow you to define the reliability of your products and services in terms of customer expectations. You can create SLOs for user journeys, internal services, or even infrastructure. For more background on SLOs, see our guide on creating Your First SLO.
Complete the following steps to create an SLO in the Nobl9 UI:
Navigate to the Service Level Objectives page.
Click the button to start the SLO wizard, and follow the five-step configuration process in the wizard.
In step 1, select a Service from the drop-down list to tag the service this SLO applies to.
In step 2, choose a Data Source from the drop-down list. Then select a type of Metric and enter a Query (refer to the table below for some examples of what can be queried):
A Threshold Metric is a single time series evaluated against a threshold.
A Ratio Metric allows you to enter two time series to compare (for example, a count of good requests and total requests).
In step 3, choose a Rolling or Calendar-Aligned Time Window:
Rolling time windows are better for tracking the recent user experience of a service.
Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, such as every calendar month or every quarter.
In step 4, select the Error Budget Calculation Method (either Occurrences or Time Slices). For more information on these options, see the use case example located in the last section of this guide.
In step 5, enter a Name for your objective.
- To configure alerts, select one or more Alert Policies from the drop-down list. (Follow the instructions in the Creating Alert Policies section if you haven’t created any alert policies yet.)
- Add a Description. Document relevant details or metadata about the SLIs and SLO.
As a best practice, we recommend adding the responsible team’s or owner’s details and a summary of the purpose of creating this specific SLO. The description can provide some quick context about this SLO to any team member.
The following are some examples of what can be queried:
Description Result Web service or API HTTPS responses with 2xx and 3xx status codes. In a queue consumer Successful processing of a message. In a serverless and function-based architectures Successful completion of an invocation. In a batch Normal exit (for example, rc == 0) of the driving process or script. In a browser application Completion of a user action without yamlScript errors.
Creating Alert Policies
Once you have created your SLO, you can configure an Alert Policy and alert method for it. An Alert Policy expresses a set of conditions you want to track or monitor. The conditions for an Alert Policy define what is monitored and when to activate an alert: when the performance of your service is declining, Nobl9 will send a notification to a predefined channel.
Alerts in Nobl9 can be sent to several different tools, including PagerDuty, MS Teams, Slack, Discord, Jira, Opsgenie, and ServiceNow. Email alerts are also supported, and you can use webhooks to send alerts to any service that has a REST API, such as FireHydrant, Asana, xMatters, and many more.
For details on how to set up an alert, refer to the Alerting section of the Nobl9 documentation.
Follow these steps to set up an Alert Policy in the web UI:
Navigate to the Alerts page.
Click the button to start the Alerts Policy wizard, and follow the configuration process in the wizard.
In step 1:
Define your Alert Conditions by selecting one or more of the boxes and choosing your parameters. You can set a maximum of three alert conditions; create another alert policy if you want to set more than three. A defined alert condition monitors the behavior and volatility of a data source:
The Error budget relies on the targets set in your service level agreement (SLA) and SLOs. Error budgets measure the maximum amount of time a system can fail without repercussions.
The Remaining error budget is the amount left from the error budget set in the SLO.
The Error budget burn rate measures how fast the error budget is disappearing. The numbers here must match the numbers in the error budget (i.e., in the SLO, or in the error budget condition defined above).
Define a Cooldown period for your Alert Policy:
Cooldown period value is mandatory, and it must be an integer value greater than or equal to 5 minutes.
The default value is 5 minutes.
You can choose between 3 types of units (hours, minutes, seconds).
When the cooldown value is saved by the YAML file and has a value with a few units (e.g., 5m30s), it will be displayed with one unit (e.g., 330 seconds) in the UI.
In step 2, select a Project, then enter a Display name (optional) and a Name for the alert (this is mandatory and will be filled in automatically if you provide a display name). Provide a description (optional), and set the Severity to one of the following:
High: A critical incident with a very high impact.
Medium: A major incident with a significant impact.
Low: A minor incident with low impact.
In step 3, select the box to set up a Webhook. Set up the integration in YAML, using
sloctlto apply the changes. The webhook integration will then be available in the Alert wizard in the web UI.
Getting Started with
Now that you've set up your first SLO and attached an Alert Policy to it, you can install our Command Line Interface,
To ensure a great onboarding experience with
Use a Mac, Linux, or Windows machine to run
The Nobl9 command-line utility makes it easy to create and update many SLOs at once. See sloctl User Guide for more details.
Select a Kubernetes cluster or any Docker environment—or use a Docker environment on your local machine—to run the Nobl9 Agent.
The Nobl9 Agent collects Service Level Indicator (SLI) metrics from your existing metrics systems, such as Datadog, New Relic, or Prometheus.
Verify that you received an email from Nobl9 with a link to set up your user account.
The command-line interface (CLI) for Nobl9 is named
sloctl. You can use the
sloctl CLI for creating or updating multiple SLOs at once, creating or updating multiple thresholds, or updating SLOs as part of continuous integration/continuous delivery (CI/CD).
The web user interface is available to give you an easy way to create and update SLOs and to familiarize you with the features available in Nobl9, while
sloctl aims to provide a systematic and/or automated approach to maintaining SLOs as code.
Use Case of SLO Configuration
In this section, we walk through creating an SLO for a sample service using
A Typical Example of a Latency SLO for a RESTful Service
First, we want to pick an appropriate Service Level Indicator to measure the latency of responses from a RESTful service. In this example, let's assume our service runs in the NGINX web server, and we're going to use a threshold-based approach to define acceptable behavior. For example, we want the service to respond in a certain amount of time.
There are many ways to measure application performance. In this case we're giving an example of server-side measurement at the application layer (NGINX). However, it might be advantageous for your application to measure this metric differently.
For example, you might choose to measure performance at the client, or at the load balancer, or somewhere else. Your choice depends on what you are trying to measure or improve, as well as what data is currently available as usable metrics for the SLI.
The threshold approach uses a single query, and we set thresholds or breaking points on the results from that query to define the boundaries of acceptable behavior. In the SLO YAML, we specify the indicator like this:
In this example, we’re using Prometheus, but the concepts are similar for other metrics stores. We recommend running the query against your Prometheus instance and reviewing the resulting data, so you can verify that the query returns what you expect and that you understand the units (whether it's returning latencies as milliseconds or fractions of a second, for example). This query seems to return data in between 60 and 150 milliseconds, with some occasional outliers.
Choosing a Time Window
When creating an SLO, we need to choose whether we want a rolling or calendar-aligned window:
Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics that are measured on a calendar-aligned basis, such as every calendar month, or every quarter.
Rolling windows are better for tracking the recent user experience of service (say, over the past week or month).
For our RESTful service, we will be using a rolling time window to measure the recent user experience. This will help us make decisions about the risk of changes, releases, and how best to invest our engineering resources on a week-to-week or sprint-to-sprint basis. We want the "recent" period that we're measuring to trail ong enough to smooth out noise.
We’ll go with a 28-day window, which has the advantage of containing an equal number of weekend days and weekdays as it rolls:
- count: 28
Choosing a Budgeting Method
There are two budgeting methods to choose from: Time Slices and Occurrences.
In the Time Slices method, what we count (the objective we measure) is how many good minutes were achieved (minutes where our system is operating within defined boundaries), compared to the total number of minutes in the window.
This is useful for some scenarios, but it has a disadvantage when we're looking at the “recent” user experience, as we are with this SLO. The disadvantage is that a bad minute that occurs during a low-traffic period (say, in the middle of the night for most of our users, when they are unlikely to even notice a performance issue) would penalize the SLO the same amount as a bad minute during peak traffic times.
The Occurrences budgeting method is well suited to this situation. With this method, we count good attempts (in this example, requests that are within defined boundaries) against the count of all attempts (i.e., all requests, including requests that perform outside of the defined boundaries). Since total attempts are fewer during low-traffic periods, it automatically adjusts to lower traffic volumes.
We’ll go with the Occurrences method:
Let’s assume we've talked to our product and support teams and can establish the following thresholds:
- The service has to respond fast enough that users don't see any lag in the web applications that use the service. Our Product Manager thinks that 100ms (1/10th of a second) is a reasonable threshold for what qualifies as okay latency. We want to try to hit that 95% of the time, so we code the first threshold like this:
- budgetTarget: 0.95
This threshold requires that 95% of requests are completed within 100ms.
You can name each threshold however you want. We recommend naming them how a user of the service (or how another service that uses this service) might describe the experience at a given threshold. Typically, we use names that are descriptive adjectives of the experience when the threshold is not met. When this threshold is violated, we can say that the user's experience is "Laggy."
- Some requests fall outside of that 100ms range. We want to make an allowance for that, but we also want to set other thresholds so that we know that even in its worst moments our service is performing acceptably, and/or that its worst moments are brief. Let's define another threshold. In the above threshold, we allow 5% of requests to run longer than 100ms. We want most of that 5%—say, 80% of the remaining 5% of the queries—to still return within 1/4th of a second (250ms). That means 99% of the queries should return within 250ms (95% +4%), so we’ll add a threshold like this:
- budgetTarget: 0.99
This threshold requires that 99% of requests are completed within 250ms.
- While that covers the bulk of requests, even within the 1% of requests that we allow to exceed 250ms, the vast majority of them should complete within half a second (500ms). So, let’s add the following threshold:
- budgetTarget: 0.999
This threshold requires that 99.9% of requests are completed within 500ms.
In sum, our SLO definition for this example use case looks like this:
- apiVersion: n9/v1alpha
nrql: SELECT average(duration) FROM SyntheticRequest WHERE monitorId = 339adbc4-01e4-4517-88cf-ece25cb66156'
- displayName: ok
- displayName: laggy
- displayName: poor
- count: 1