Skip to main content

Agent Metrics

While the Nobl9 Agent is a stable and lightweight application, Nobl9 users would like to have data-based insights into its health and understand if it is operational, how it is performing, and if it could use resource utilization updates.

With the Agent metrics feature, you can get various Agent health and resource utilization numbers available to scrape at /health and /metrics endpoints.

Requirements

You can enable metrics configuration by exposing your Agent data via environmental variables in your Docker container or Kubernetes cluster.

To get the Nobl9 Agent metrics, you need to have a monitoring solution that can scrape a Prometheus-compliant metrics endpoint.

Metric Endpoints

/health Endpoint

This endpoint returns an HTTP/1.1 200 OK response to a standard GET request if the Agent is "healthy." The OK response means that the Agent code has completed initialization and is running.

/metrics Endpoint

This endpoint returns Prometheus-scrapable metrics data in response to a standard GET request. It is a text-based endpoint handled by the Golang Prometheus libraries.

Agent's Default Port

To scrape Agent's metrics, you need to define the N9_METRICS_PORT as an environmental variable while deploying your Agent through Kubernetes YAML or a Docker invocation generated in the UI:

Here's a shortened example of Kubernetes deployment YAML with the defined N9_METRICS_PORT variable:

apiVersion: v1
kind: Secret
<...>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nobl9-agent-example
namespace: default
spec:
<...>
spec:
containers:
- name: agent-container
image: nobl9/agent:latest
resources:
requests:
memory: "350Mi"
cpu: "0.1"
env:
<...>
- name: N9_METRICS_PORT
value: "9090"
NOTES
  • The N9_METRICS_PORT is a variable specifying the TCP port to which the /metrics and /health endpoints are exposed.

  • The 9090 is the default value, and you can change it to adjust to the port that you’re using on your infrastructure.

  • If you don't want the Agent metrics to be exposed, comment out or delete the N9_METRICS_PORT variable.

List of Available Agent Metrics

The following is the list of all available Agent metrics at /health and /metrics endpoints:

Metric nameTypeLabelsDescriptionSplunk Obs?
n9_sli_totalgaugeTotal count of the SLIs for which the Agent collects dataY
n9_query_latencyhistogramLatency histogram of all SLI query requestsN
n9_upload_latencyhistogramLatency histogram of all data uploads by SLI to the N9 platformN
n9_query_errorscountHTTP codeTotal count of errors encountered while executing SLI queries since last Agent startN
n9_query_totalcountHTTP codeTotal count of all queries ran since last Agent startN
n9_upload_errorscountHTTP codeTotal count of errors encountered while uploading data to the N9 platform since last Agent startY
n9_upload_totalcountHTTP codeTotal count of all data uploaded to the N9 platform since last Agent startY
n9_bytes_received_totalcountTotal count of bytes received by the Agent since last startY
n9_bytes_sent_totalcountTotal count of bytes sent since last Agent startY
n9_payload_sent_totalcountTotal count of bytes sent in all requests’ content since last Agent startY
n9_payload_received_totalcountTotal count of bytes requested in all responses’ content since last Agent startY
n9_last_query_success_timecountSeconds since last successful query was sentY
n9_last_upload_success_timecountSeconds since last successful data uploadY
n9_uptimecountAgent versionSeconds since Agent startY
n9_query_intervalgaugeQuery interval configured in secondsN/A1)
n9_query_delaygaugeQuery delay configured in secondsN/A2)
n9_buffer_capacitygaugeTotal capacity of metrics buffer in N9 AgentY
n9_buffer_loadcountTotal count of metrics in buffer that has not yet been successfully uploaded to the N9 platformY
1), 2): The Nobl9 Agent doesn't process these metrics since they're based on the WebSocket interface in Splunk Observability.

Metrics Use Cases

Metrics can help you better understand what the Nobl9 agent is doing in their Kubernetes environment.

Peak in Agent's Memory Utilization

For instance, if you detect a peak in memory usage of an active Nobl9 Agent, metrics allow you to scrape Agent’s utilization data with Prometheus and display a graph in Grafana to see the agent memory usage over time.

Breaks in Agent's Operation

You might also encounter a situation where Kubernetes reports the pod is in a Running state, but the agent appears to stop operating at times. By scraping metrics data with Prometheus and displaying graphs with Grafana, you can get details about Agent’s operation over time.

For instance, you might see spikes in the graph displaying n9_upload_errors while the graph displaying n9_upload_send_bytes_total has stopped growing. At the same time, the panel displaying n9_last_upload_success_time indicates uploads have stopped, and the graph displaying n9_queue_size_bytes_total has also started growing. You can leverage this data with Nobl9 support and your local networking team to help troubleshoot the issue further.

Examples of Agent Metric Visualizations

Using Prometheus, you can collect metrics and analyze them using a tool like Grafana.

  • Here's an example of Grafana visualization of the n9_uptime metric:
n9-uptime
Image 1: Visualization of the n9_uptime metric
  • Here's an example of Grafana visualization of the n9_query_total metric for several Nobl9 Agents:
n9-query-total
Image 2: Visualization of the n9_query_total metric
  • Here's an example of Grafana visualization of the n9_upload_total metric:
n9-upload-total
Image 3: Visualization of the n9_upload_total metric