Nobl9 agent
β Nobl9 agents monitoringβ
The Grafana dashboard monitors the performance and health of your Nobl9 agents.
It is available for the beta-version agents with Prometheus metrics.
The dashboard visualizes vital metrics, making it easier to identify issues, track performance and health trends, and spot anomalies.
It comprises three focus areas:
- Overview: the uptime of your agents over time, the number of running SLOs, and the last connection to Nobl9.
- Data intake: query latency, data points received, and successful and failed queries.
- Data upload: the number of data points uploaded along with the upload latency.
To use the Grafana dashboard, make sure the following prerequisites are met:
- You have a Grafana instance with the Prometheus data source.
- The Nobl9 agent Prometheus exporter is enabled.
- The
N9_METRICS_PORT
port is open for metric scrapping by Prometheus. - The Nobl9 agent version
>= 0.74.0-beta
or higher.
To install the dashboard, do the following:
- Import the dashboard JSON file into your Grafana instance.
- Go to the dashboard settings.
Configure the Prometheus data source to point to your Prometheus server.
β Logging mode optionsβ
In the normal logging mode, the Nobl9 agent writes events at startup and when errors are encountered only. If you have no SLOs, you see the startup events only. If no errors are returned, this means you have successfully connected to Nobl9.
After adding your first SLO, the Nobl9 agent starts emitting query request and response diagnostic logs every 10 minutes per SLI. The logs are labeled as follows:
info
for successful query logswarn
for unsuccessful query logs
You can modify the emission frequencyβfor this, set the value of N9_DIAGNOSTIC_QUERY_LOG_SAMPLE_INTERVAL_MINUTES
to the required interval.
To deactivate diagnostic logging, set its value to 0
.
- See the list of data sources supporting diagnostic logging for live SLI ingestion.
- In most cases, these logs can help you diagnose the issue. Note that problems are usually related to the firewall, authentication, or the query.
In addition to diagnostic logs, you can activate verbose logging. This option outputs all logs related to all operations happening when you execute commands in the agent.
You can activate verbose logging with Kubernetes or Docker.
- Kubernetes
- Docker
When the agent is already deployed in your Kubernetes cluster, include the args
and command
fields to the YAML configuration file on the level of your container and use these fields to inject data into your agent.
Once the pod is created with custom values defined for command and arguments, these fields cannot be changed.
spec:
containers:
- name: agent-container
image: nobl9/agent:latest-beta
resources:
requests:
memory: "350Mi"
cpu: "0.1"
env:
- name: N9_DIAGNOSTIC_QUERY_LOG_SAMPLE_INTERVAL_MINUTES
value: "5"
args:
- --debug
command:
- telegraf
When you invoke the agent with docker run
, add --debug
at the end of all the statements that are given on the Nobl9 Web:
docker run -d --restart on-failure \
--name nobl9-agent-nobl9-dev-datadog-month-test \
-e N9_CLIENT_ID="unique_client_id" \
-e N9_CLIENT_SECRET="unique_client_secret" \
-e DD_API_KEY="<DATADOG_API_KEY>" \
-e DD_APPLICATION_KEY="<DATADOG_APPLICATION_KEY>" \
-e N9_DIAGNOSTIC_QUERY_LOG_SAMPLE_INTERVAL_MINUTES="5" \
telegraf --debug \
nobl9/agent:latest-beta
β Use custom argumentsβ
The Nobl9 agent image has no ENTRYPOINT
defined, so when passing custom args
, the command
must be set to run the telegraf
executable.
As a base image, the agent uses a distroless imageβgcr.io/distroless/static-debian11
βand doesn't contain a shell.
apiVersion: apps/v1
kind: Deployment
spec:
serviceAccountName: release-name-nobl9-agent
containers:
- name: agent-container
image: nobl9/agent:latest
imagePullPolicy: Always
args:
- --debug
command:
- telegraf
You can also use environment variables for the args
field.
Specify the variable following the pattern: $(VAR)
.
For example,
containers:
- name: agent-container
image: nobl9/agent:latest
imagePullPolicy: Always
args:
- "$(AGENT_ARGS)"
β Agent monitoringβ
To monitor the health and status of your agents, scrape agent metrics.
β Custom SSL certificatesβ
For security purposes, Nobl9 uses distroless
-based docker image for the agent.
When you need to use a custom SSL certificate,
provide your mycert.crt
file and build a custom agent docker image with the following snippet:
FROM debian:11 as builder
RUN apt update
RUN apt install ca-certificates -y
COPY ./mycert.crt /usr/local/share/ca-certificates/mycert.crt
RUN update-ca-certificates
FROM nobl9/agent:latest-beta # put fixed version here
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
In rare cases, you might need to skip certificate verification, for example, when setting the Nobl9 agent in a temporary environment.
For this, create the cfg.toml
file and set verification skipping to true
:
[n9prometheus]
insecure_skip_verify = true
Then, run the agent with the above-mentioned configuration, setting the N9_LOCAL_CONFIG
environment variable with the path to your local cfg.toml
file, for example:
docker run -d --restart on-failure \
--name nobl9-agent-name \
-e N9_CLIENT_SECRET="[redacted]" \
-e N9_METRICS_PORT=9090 \
-e N9_CLIENT_ID="[redacted]" \
-e N9_LOCAL_CONFIG=/tmp/cfg.toml \
-v $(pwd)/cfg.toml:/tmp/cfg.toml \
nobl9/agent:latest
Nobl9 strongly recommends using certificate verification for all connections. This ensures a secure connection and protects against potential security risks.
Consider skipping verification only as a last resort, when providing the root certificate is extremely difficult or impossible.
β Retrieving Client ID and Client Secretβ
You can retrieve agents' client credentials through the sloctl get agents -k
command.
It retrieves client credentials for all agents in the default
project.
To retrieve client credentials for all agents in your organization,
use the -A
flag as follows: sloctl get agents -Ak
.
To retrieve client credentials for specific agents, list their names explicitly in your command, for example:
sloctl get agents my-agent1 my-agent2
β Missing dataβ
When your agent fails to exchange data with a data source, start troubleshooting with checking the following:
Make sure your agent is running: an agent on a desktop stops running and sending data when your machine sleeps.
If the agent is running properly, and data is still missing, the reason can be security tools that block connections from unknown sources to protect your data source. Adding Nobl9 IP addresses as allowed for connection can help.
- 18.159.114.21
- 18.158.132.186
- 3.64.154.26
β Query errorsβ
Nobl9 is designed to accept single metric values back from queries. If you see a query error, check that query returns a single metric value.
When data is missing for Splunk queries, make sure the number of events doesn't exceed the limits set.
β Timeoutβ
Leveraging a timeout of the Nobl9 agent requires an additional configuration.
For this, create a file with the timeout
variable and set the required value,
keeping it less than 60s
.
For example, create cfg.toml
and adjust the timeout value as it's shown in the Prometheus example:
[n9prometheus]
timeout = "20s"
Then, run the agent with the above-mentioned configuration: set the N9_LOCAL_CONFIG
environment variable with the path to your local cfg.toml
file.
Read more about query customization.
β Data backlogβ
Nobl9 handles data backlog in various cases. Below, you can find several possible data backlog scenarios and solutions to help you address them.
Impact on the Nobl9 agentβ
The backlogging issue (for instance, network issues) differs between the data sources available through Nobl9.
In general, if the Nobl9 agent can't reach the Nobl9 data intake, it caches using the FIFO (First In, First Out) method and tries to reconnect indefinitely.
If the connection outage lasts longer, it may exhaust the agent's cache. You can allocate as much memory to the container as you need. Cache size depends on the memory you allocate to the container.
Also, you can activate agent's timestamp cache persistence to prevent such situations.
Data outage in the data sourceβ
If the data source is out, Nobl9 won't get data, and this can impact your error budget. Per every data source, an agent maximum time window is defined.
If the agent keeps running, it will try to catch up after reconnecting (i.e., varying by data source). If the agent restarts, then it's possible it would stop retrying.
Nobl9 integration mechanism that queries APIs or metrics systems is naturally susceptible to outages between the agent and the metrics system or between the agent and Nobl9. These outages may also include the outage of the metrics system itself.
Remember that SLOs are always approximations of the SLI metric and are not ideal reflections of this metric.
Agent time windows for collecting missing dataβ
When the Nobl9 agent stops collecting data points (for example, due to an incident on the data source's side), it caches the last collected data point. Then, it tries to fetch missing data points for a specified time window that depends on the data source your agent is connected to.
The Nobl9 agent keeps the information about the last successfully fetched data in the timestamp cache. When it gets data from a data source, it attempts to do so, beginning with the cached timestamp. In case the agent can't collect data points (for example, due to a data source outage), it tries to fetch missing data points for a specified time window that depends on the data source your agent is connected to.
Source | Agent's max time window |
---|---|
Amazon CloudWatch | 60m |
Amazon Redshift | 60m |
AppDynamics | 60m |
Azure Monitor beta | 60m |
BigQuery | 60m |
Datadog | 45m |
Dynatrace | 60m |
ElasticSearch | 60m |
Google Cloud Monitoring | 60m |
InfluxDB | 60m |
Instana | 5h (600 data points, each every 30s ) |
ServiceNow Cloud Observability | 60m |
NewRelic | 87m30s (350 buckets, each every 15s ) |
Pingdom | 60m |
OpenTSDB | 60m |
Splunk | 60m |
Splunk Observability | 60m |
SumoLogic | 30m |
ThousandEyes | 60m |
Effectively, this means that if your agent exceeded the time window and didn't collect any data points, it moves the time window forward and won't be able to fetch missing points from dropped minutes.
If that's the case, we recommend replaying the involved SLOs once you've resolved all issues with the agent.