Agent Troubleshooting
Logging Mode Options
In the normal logging mode, the Nobl9 Agent only writes events at startup and when errors are encountered. If you have no SLOs, you will only see the startup events. If no errors are returned, this means you have successfully connected to Nobl9.
After adding your first SLO, you will only see a new log message if there is an error. In most cases, these logs can help you diagnose the issue. Note that problems are usually related to the firewall, authentication, or the query.
For debugging purposes, the Agent allows you to enable verbose logging. This means that all logs related to all operations that happen when you execute commands in the Agent will be displayed in the output. You can enable this option as follows:
Kubernetes: If the Agent is already deployed in your Kubernetes cluster, add
args: ["--debug"]
to the YAML configuration file on the level of your container:spec:
containers:
- name: agent-container
image: nobl9/agent:latest
resources:
requests:
memory: "350Mi"
cpu: "0.1"
args: ["--debug"]Docker: When you invoke the Agent with
docker run
, add--debug
at the end of all the statements that are given in the UI:docker run -d --restart on-failure \
--name nobl9-agent-nobl9-dev-datadog-month-test \
-e N9_CLIENT_ID="unique_client_id" \
-e N9_CLIENT_SECRET="unique_client_secret" \
-e DD_API_KEY="<DATADOG_API_KEY>" \
-e DD_APPLICATION_KEY="<DATADOG_APPLICATION_KEY>" \
telegraf --debug \
nobl9/agent:latest
Troubleshooting
You can monitor health and status of your Agents by scraping Agent Metrics | Nobl9 Documentation.
Custom SSL Certificates
For security purposes, Nobl9 uses distroless
-based docker image for the Agent. When you need to use a custom SSL certificate, provide your mycert.crt
file and build a custom Agent docker image with the following snippet:
FROM debian:11 as builder
USER root
RUN apt update
RUN apt install ca-certificates -y
COPY ./mycert.crt /usr/local/share/ca-certificates/mycert.crt
RUN update-ca-certificates
FROM nobl9/agent:latest # put fixed version here
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt
Retrieving client_id
and client_secret
You can retrieve Agents' client_id
and client_secret
through the sloctl get agents -k
command. It retrieves client_id
and client_secret
for all Agents in the default
project. If you want to retrieve client_id
and client_secret
for all Agents in your Organization, use the -A
flag as follows: sloctl get agents -Ak
. If you want to retrieve the client_id
and client_secret
for specific agents, you can name them explicitly in your command, for example:
sloctl get agents my-agent1 my-agent2
Missing Data
If data appears to be missing, check whether your Agent is running. An Agent that runs on your desktop will stop running and sending data when your machine is sleeping.
Data Backlog
Our users often ask how Nobl9 handles data backlog in various cases. Below, you can find several possible data backlog scenarios and solutions that will help you address them.
Backlogging impact on the Nobl9 Agent
Backlogging issue (for instance, network issues) differs between the data sources available through Nobl9. In general, if the Nobl9 Agent can't reach the N9 data intake, it caches using the FIFO (First In, First Out) method and tries to reconnect indefinitely.
If the connection outage lasts longer, it may exhaust the Agent's cache. It is also worth noting that the cache is user-configurable by how much memory is allocated to the container.
With the Agent version 0.44.0
, you can enable Agent's timestamps cache persistence to prevent such situations. Refer to Agent persistence for more details.
Data outage in the Data Source
If the data source is out, Nobl9 won't get data, which might impact your Error budget. See table below for more details.
If the Agent keeps running, it will try to catch up after reconnecting (i.e., varying by data source). If the Agent restarts, then it's possible it would stop retrying.
Nobl9 integration mechanism that queries APIs or metrics systems is naturally susceptible to outages between the Agent and the metrics system or between the Agent and Nobl9. These outages may also include the outage of the metrics system itself.
Remember that SLOs are always approximations of the SLI metric and are not ideal reflections of this metric.
Agent's Time Windows for Collecting Missing Data
When the Nobl9 Agent stops collecting data points (for example, due to an incident on the data source's side), it caches the last collected data point. Then, it tries to fetch missing data points for a specified time window that depends on the data source your Agent is connected to.
The Nobl9 Agent keeps the information about the last successfully fetched data in the timestamps cache. When it gets data from a data source, it attempts to do so, beginning from the cached timestamp. In case the Agent can't collect data points (for example, due to a data source outage) it tries to fetch missing data points for a specified time window that depends on the data source your Agent is connected to.
Source | Agent's Max Time Window |
---|---|
Amazon CloudWatch | 60m |
Amazon Redshift | 60m |
AppDynamics | 60m |
BigQuery | 60m |
Datadog | 45m |
Dynatrace | 60m |
ElasticSearch | 60m |
Google Cloud Monitoring | 60m |
InfluxDB | 60m |
Instana | 5h (600 data points, each every 30s ) |
ServiceNow Cloud Observability | 60m |
NewRelic | 87m30s (350 buckets, each every 15s ) |
Pingdom | 60m |
OpenTSDB | 60m |
Splunk | 60m |
Splunk Observability | 60m |
SumoLogic | 30m |
ThousandEyes | 60m |
Effectively, this means that if your Agent exceeded the time window and didn't collect any data points, it moves the time window forward and won't be able to fetch missing points from dropped minutes.
If that's the case - we recommend reimporting historical data once you've resolved all issues with the Agent.
Query Errors
Nobl9 is designed to accept single metric values back from queries. If you see a query error, check that what is being returned by that query is a single metric value.
Splunk queries may behave differently; see the documentation for more details.
Timeout
Leveraging a timeout of the Nobl9 Agent requires an additional configuration in the config file. To do so, create a file, for example, cfg.toml
, with the following variables to adjust the timeout value for Prometheus:
[n9prometheus]
timeout = "20s"
The timeout
value must be less than 60s.
Then, run the Agent with the above-mentioned configuration by setting the N9_LOCAL_CONFIG
environment variable with the path to your local cfg.toml
file.