Skip to main content

Agent metrics

Reading time: 0 minute(s) (0 words)

While the Nobl9 agent is a stable and lightweight application, Nobl9 users would like to have data-based insights into its health and understand if it is operational, how it is performing, and if it could use resource utilization updates.

With the agent metrics feature, you can get various agent health and resource utilization numbers available to scrape at /health and /metrics endpoints.

Requirements

You can activate metrics configuration by exposing your agent data via environmental variables in your Docker container or Kubernetes cluster.

To get the Nobl9 agent metrics, you need to have a monitoring solution that can scrape a Prometheus-compliant metrics endpoint.

Metric endpoints

/health endpoint

This endpoint returns an HTTP/1.1 200 OK response to a standard GET request if the agent is "healthy." The OK response means that the agent code has completed initialization and is running.

/metrics endpoint

This endpoint returns Prometheus-scrapable metrics data in response to a standard GET request. It is a text-based endpoint handled by the Golang Prometheus libraries.

Agent's default port

To scrape agent's metrics, you need to define the N9_METRICS_PORT as an environmental variable while deploying your agent through Kubernetes YAML or a Docker invocation generated in the UI:

Here's a shortened example of Kubernetes deployment YAML with the defined N9_METRICS_PORT variable:

apiVersion: v1
kind: Secret
<...>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nobl9-agent-example
namespace: default
spec:
<...>
spec:
containers:
- name: agent-container
image: nobl9/agent:0.69.4
resources:
requests:
memory: "350Mi"
cpu: "0.1"
env:
<...>
- name: N9_METRICS_PORT
value: "9090"
NOTES
  • The N9_METRICS_PORT is a variable specifying the TCP port to which the /metrics and /health endpoints are exposed.

  • Port 9090 is the default value, and you can change it to adjust to the port that you’re using on your infrastructure.

  • If you don't want the agent metrics to be exposed, comment out or delete the N9_METRICS_PORT variable.

List of available agent metrics

The following is the list of all available agent metrics at /health and /metrics endpoints:

Metric nameTypeLabelsDescriptionSplunk Obs?
n9_accumulated_pointscountervec
  • organization
  • plugin_name
  • metric_source (optional)1)
Points added to accumulator since agent start.Y
n9_buffer_capacitygaugeTotal capacity of metrics buffer in Nobl9 agentY
n9_buffer_loadgaugeTotal count of metrics in buffer that has not yet been successfully uploaded to the Nobl9 platformY
n9_bytes_received_totalcountTotal count of bytes received by the agent since last startY
n9_bytes_sent_totalcountTotal count of bytes sent since last agent startY
n9_emitted_pointscounterPoints successfully emitted to the N9 platform since agent start.Y
n9_input_received_bytescountervec
  • organization
  • plugin_name
  • metric_source (optional)1)
Bytes received in input responses’ bodies since last agent start.Y
n9_input_sent_bytescountervec
  • organization
  • plugin_name
  • metric_source (optional)1)
Bytes sent in input requests’ bodies since last agent start.Y
n9_last_config_update_success_timecounterSeconds since last successful config read.Y
n9_last_input_successful_response_timegaugevec
  • organization
  • plugin_name
  • metric_source (optional)1)
Seconds since last successful input responseY
n9_last_query_success_time3)
countSeconds since last successful query was sent.

Due to deprecation, use last_input_successful_response_time.
Y
n9_last_upload_success_timecountSeconds since last successful data uploadY
n9_output_received_bytescounterBytes received in output responses’ bodies since last agent start.Y
n9_output_sent_bytescounterBytes sent in output requests’ bodies since last agent start.Y
n9_payload_received_total3)
countTotal count of bytes requested in all responses’ content since last agent start.

Due to deprecation, use input_received_bytes and output_received_bytes instead.
Y
n9_payload_sent_total3)
countTotal count of bytes sent in all requests’ content since last agent start.

Due to deprecation, use input_sent_bytes and output_sent_bytes instead.
Y
n9_query_delay3)
gaugeQuery delay configured in seconds.

Due to deprecation, look for runtime log with appropriate info.
N/A4)
n9_query_errors3)
count
  • status_code
Total count of errors encountered while executing SLI queries since last agent start.

Due to deprecation, use n9_query_total instead.
N
n9_query_interval3)
gaugeQuery interval configured in seconds.

Due to deprecation, look for runtime log with appropriate info.
N/A3)
n9_query_latencyhistogramvec
  • organization
  • plugin_name
  • metric_source (optional)1)
Latency histogram of all SLI query requestsN
n9_query_totalcountervec
  • organization
  • plugin_name
  • status_code
  • metric_source (optional)1)
Total count of all queries ran since last agent startN
n9_sli_totalgaugevec
  • organization
  • plugin_name
  • metric_source (optional)1)
Total count of the SLIs for which the agent collects dataY
n9_upload_errors3)
count
  • status_code
Total count of errors encountered while uploading data to the Nobl9 platform since last agent start.

Due to deprecation, use n9_upload_total instead.
Y
n9_upload_latencyhistogramLatency histogram of all data uploads by SLI to the Nobl9 platformN
n9_upload_totalcount
  • status_code
Total count of all data uploaded to the Nobl9 platform since last agent startY
n9_uptimecount
  • agent_version
Seconds since agent startY

Important notes

1) To activate, set N9_INCLUDE_METRIC_SOURCE_NAME_IN_METRICS = "true" env var in your agent deployment
2) Deprecated metrics will be removed in the April agent release
3), 4) The Nobl9 agent doesn't process these metrics since they're based on the WebSocket interface in Splunk Observability

Metrics use cases

Metrics can help you better understand what the Nobl9 agent is doing in your Kubernetes environment.

Peak in agent's memory utilization

For instance, if you detect a peak in memory usage of an active Nobl9 agent, metrics allow you to scrape agent’s utilization data with Prometheus and display a graph in Grafana to see the agent memory usage over time.

Breaks in agent's operation

You might also encounter a situation where Kubernetes reports the pod is in a Running state, but the agent appears to stop operating at times. By scraping metrics data with Prometheus and displaying graphs with Grafana, you can get details about agent’s operation over time.

For instance, you might see spikes in the graph displaying n9_upload_errors (deprecated, use n9_upload_total{status_code!~"200"}) while the graph displaying n9_bytes_sent_total has stopped growing. At the same time, the panel displaying n9_last_upload_success_time indicates uploads have stopped, and the graph displaying n9_buffer_load has also started growing. You can leverage this data with Nobl9 support and your local networking team to help troubleshoot the issue further.

Examples of agent metric visualizations

Using Prometheus, you can collect metrics and analyze them using a tool like Grafana.

  • Here's an example of Grafana visualization of the n9_uptime metric:
n9-uptime
Image 1: Visualization of the n9_uptime metric
  • Here's an example of Grafana visualization of the n9_query_total metric for several Nobl9 agents:
n9-query-total
Image 2: Visualization of the n9_query_total metric
  • Here's an example of Grafana visualization of the n9_upload_total metric:
n9-upload-total
Image 3: Visualization of the n9_upload_total metric