Skip to main content

Agent metrics

Reading time: 0 minute(s) (0 words)

Monitoring the health and performance of the Nobl9 agent is crucial for ensuring reliable data collection. To facilitate this, Nobl9 exposes key metrics that provide insights into the agent's operational status and resource utilization.

These metrics, available for scraping at the /health and /metrics endpoints, allow you to verify that the agent is running optimally and to make any necessary resource adjustments.

You can activate metrics configuration by exposing your agent data using environmental variables in your Docker container or Kubernetes cluster.

To get the Nobl9 agent metrics, you need to have a monitoring solution that can scrape a Prometheus-compliant metrics endpoint.

Metric endpoints​

EndpointExample requestResponseNotes
/healthcurl -X GET http://<agent-IP>/healthHTTP/1.1 200 OKOK means the agent code completed initialization and is running
/metrics️ curl -X GET http://<agent-IP>/metricsAvailable metricsText-based response handled by the Golang Prometheus libraries

Agent's default port​

To scrape agent's metrics, define the N9_METRICS_PORT as an environmental variable in your agent configuration when deploying it to Kubernetes or Docker:

The example Kubernetes deployment YAML with the defined N9_METRICS_PORT variable:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nobl9-agent-example
namespace: default
spec:
<...>
spec:
containers:
- name: agent-container
image: nobl9/agent:0.97.0
resources:
requests:
memory: "350Mi"
cpu: "0.1"
env:
<...>
- name: N9_METRICS_PORT
value: "9090"
N9_METRICS_PORT configuration
  • Purpose: Specifies the TCP port for the /metrics and /health endpoints.
  • Default: 9090. You can change this value to suit your needs.
  • To disable: Remove this variable to stop exposing the endpoints.

Available metrics by version​

Select Nobl9 agent version:
Metric nameTypeLabelsDescription
n9_accumulated_pointscountervec
  • metric_source
  • organization
  • plugin_name
  • slo_name (optional)
Points added to the accumulator since the agent start
n9_all_buffer_metrics_droppedcounterTotal number of dropped metrics due to the buffer overflow, unlabeled
n9_buffer_capacitygaugeTotal capacity of the metrics buffer in the Nobl9 agent
n9_buffer_loadgaugeTotal count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform
n9_bytes_received_totalcounterTotal count of bytes received by the agent since the last start
n9_bytes_sent_totalcounterTotal count of bytes sent since the last agent start
n9_emitted_pointscountervec
  • metric_source
  • organization
Points successfully emitted to the N9 platform since the agent start
n9_input_sent_bytescounter
  • metric_source
  • organization
  • plugin_name
Bytes sent in input requests’ bodies since the last agent start
n9_last_config_update_success_timegauge
  • metric_source
  • organization
Seconds since the last successful config read
n9_last_input_successful_response_timegauge
  • metric_source
  • organization
  • plugin_name
  • slo_name
Seconds since the last successful input response
n9_last_upload_success_timecounterSeconds since the last successful data upload
n9_output_received_bytescounterBytes received in output responses’ bodies since the last agent start
n9_output_sent_bytescounterBytes sent in output responses’ bodies since the last agent start
n9_query_laghistogramvec
  • metric_source
  • organization
  • plugin_name
  • slo_name (optional)
Histogram of SLO query lag by organization, plugin name, metric source
n9_query_latencyhistogramvec
  • metric_source
  • organization
  • plugin_name
  • slo_name (optional)
Latency histogram of all SLI query requests
n9_query_time_rangehistogramvec
  • metric_source
  • organization
  • plugin_name
  • slo_name (optional)
Time range histogram of all SLI query requests per organization, plugin, and metric source
n9_query_totalcountervec
  • metric_source
  • organization
  • plugin_name
  • slo_name (optional)
  • status_code
Total count of all queries ran since the last agent start
n9_skipped_points_out_of_timecountervec
  • metric_source
  • organization
  • plugin_name
  • slo_name
Count of skipped points due to being requested out of time range. LogicMonitor only
n9_sli_totalgaugevec
  • metric_source
  • organization
  • plugin_name
Total count of the SLIs the agent collects data for
n9_upload_latencyhistogramvec
  • metric_source
  • organization
Latency histogram of all data uploads by SLI to the Nobl9 platform
n9_upload_totalcountervec
  • status_code
Total count of all data uploaded to the Nobl9 platform since the last agent start
n9_uptimecounter
  • agent_version
  • metric_source
  • organization
Seconds since the agent start

The slo_name label (available since Nobl9 agent v0.98.0-beta and later) is applicable when N9_METRICS_INCLUDE_SLO_LABELS=true. This feature is disabled by default to prevent potential high-cardinality issues in agent metrics.

Key points:

  • When several SLOs share the same query, the first SLO (alphabetically) is used
  • For multi-indicator integrations where multiple queries are combined in one request, the first SLO is used with the batch prefix
  • Check your data source logs to correlate all SLOs using a particular query

Metrics usage​

Agent metrics offer crucial insights into the Nobl9 agent's performance and health within your Kubernetes environment. By visualizing these metrics, you can effectively diagnose operational issues and monitor resource usage. Here are a couple of examples:

Monitoring memory utilization
If you notice a spike in the agent's memory consumption, you can analyze its memory usage over time. This helps identify the cause, such as high query load or specific data source configurations, allowing you to optimize resource allocation.

Troubleshooting operational pauses
An agent might appear to be running in Kubernetes but periodically stop sending data. Analyzing the following metrics can help pinpoint the problem:

  • n9_upload_total{status_code!~"200"}: A spike here indicates failed data uploads to Nobl9.
  • n9_bytes_sent_total: The graph for this metric stops growing, showing that data transmission has halted.
  • n9_last_upload_success_time: This metric shows the timestamp of the last successful upload, confirming the pause.
  • n9_buffer_load: This value will steadily increase as the agent queues data it can't send.

Together, these metrics strongly suggest a connectivity issue. This information is valuable for further troubleshooting with your networking team or Nobl9 support.

Metrics visualization in Grafana​

Using Prometheus, you can collect metrics and analyze them with the Nobl9 agents Grafana dashboard.

  • n9_query_latency:
n9-query-latency
Visualization of the n9_query_latency metric
  • n9_upload_latency:
n9-upload-latency
Visualization of the n9_upload_latency metric