Agent metrics
While the Nobl9 agent is a stable and lightweight application, Nobl9 users would like to have data-based insights into its health and understand if it is operational, how it is performing, and if it could use resource utilization updates.
With the agent metrics feature, you can get various agent health and resource utilization numbers available to scrape at /health
and /metrics
endpoints.
Requirementsâ
You can activate metrics configuration by exposing your agent data via environmental variables in your Docker container or Kubernetes cluster.
To get the Nobl9 agent metrics, you need to have a monitoring solution that can scrape a Prometheus-compliant metrics endpoint.
Metric endpointsâ
/health
endpointâ
This endpoint returns an HTTP/1.1 200
OK
response to a standard GET
request if the agent is "healthy." The OK
response means that the agent code has completed initialization and is running.
/metrics
endpointâ
This endpoint returns Prometheus-scrapable metrics data in response to a standard GET
request. It is a text-based endpoint handled by the Golang Prometheus libraries.
Agent's default portâ
To scrape agent's metrics, you need to define the N9_METRICS_PORT
as an environmental variable while deploying your agent through Kubernetes YAML or a Docker invocation generated in the UI:
- Kubernetes
- Docker
Here's a shortened example of Kubernetes deployment YAML with the defined N9_METRICS_PORT
variable:
apiVersion: v1
kind: Secret
<...>
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nobl9-agent-example
namespace: default
spec:
<...>
spec:
containers:
- name: agent-container
image: nobl9/agent:0.82.2
resources:
requests:
memory: "350Mi"
cpu: "0.1"
env:
<...>
- name: N9_METRICS_PORT
value: "9090"
Here's an example of Docker deployment command with the defined N9_METRICS_PORT
variable:
docker run -d --restart on-failure \
--name nobl9-agent-example \
-e N9_CLIENT_ID="unique_client_id" \
-e N9_CLIENT_SECRET="unique_client_secret" \
-e N9_METRICS_PORT="9090" \
nobl9/agent:0.82.2
-
The
N9_METRICS_PORT
is a variable specifying the TCP port to which the/metrics
and/health
endpoints are exposed. -
Port
9090
is the default value, and you can change it to adjust to the port that youâre using on your infrastructure. -
If you don't want the agent metrics to be exposed, comment out or delete the
N9_METRICS_PORT
variable.
List of available agent metricsâ
Metrics available in stable versionsâ
The following is the list of all available agent metrics at /health
and /metrics
endpoints applicable to stable versions:
- Since 0.80.0
- Since 0.76.0
- Since 0.73.2
- Since 0.71.0
- Before 0.71.0
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to accumulator since agent start | Y |
n9_all_buffer_metrics_dropped | count | Total number of dropped metrics due to the buffer overflow, unlabeled | Y | |
n9_buffer_dropped_metrics | countervec |
| Number of dropped metrics due to the buffer overflow, labeled | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | countervec |
| Points successfully emitted to the N9 platform since the agent start | Y |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | countervec |
| Seconds since the last successful config read | Y |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_latency | histogramvec |
| Latency histogram of all data uploads by SLI to the Nobl9 platform | N |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to accumulator since agent start | Y |
n9_all_buffer_metrics_dropped | count | Total number of dropped metrics due to the buffer overflow, unlabeled | Y | |
n9_buffer_dropped_metrics | countervec |
| Number of dropped metrics due to the buffer overflow, labeled | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | countervec |
| Points successfully emitted to the N9 platform since the agent start | Y |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | countervec |
| Seconds since the last successful config read | Y |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_query_success_time | count | Seconds since the last successful query was sent. Replaced with last_input_successful_response_time | Y | |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_payload_received_total | count | Total count of bytes requested in all responsesâ content since the last agent start. Replaced with input_received_bytes and output_received_bytes | Y | |
n9_payload_sent_total | count | Total count of bytes sent in all requestsâ content since the last agent start. Replaced with input_sent_bytes and output_sent_bytes | Y | |
n9_query_delay | gauge | Query delay configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A | |
n9_query_errors | countervec |
| Total count of errors encountered while executing SLI queries since the last agent start. Replaced with n9_query_total | N |
n9_query_interval | gaugevec |
| Query interval configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_errors | countervec |
| Total count of errors encountered while uploading data to the Nobl9 platform since the last agent start. Replaced with n9_upload_total | Y |
n9_upload_latency | histogramvec |
| Latency histogram of all data uploads by SLI to the Nobl9 platform | N |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to accumulator since agent start | Y |
n9_all_buffer_metrics_dropped | count | Total number of dropped metrics due to the buffer overflow, unlabeled | Y | |
n9_buffer_dropped_metrics | countervec |
| Number of dropped metrics due to the buffer overflow, labeled | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | countervec |
| Points successfully emitted to the N9 platform since the agent start | Y |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | countervec |
| Seconds since the last successful config read | Y |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_query_success_time | count | Seconds since the last successful query was sent. Replaced with last_input_successful_response_time | Y | |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_payload_received_total | count | Total count of bytes requested in all responsesâ content since the last agent start. Replaced with input_received_bytes and output_received_bytes | Y | |
n9_payload_sent_total | count | Total count of bytes sent in all requestsâ content since the last agent start. Replaced with input_sent_bytes and output_sent_bytes | Y | |
n9_query_delay | gauge | Query delay configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A | |
n9_query_errors | countervec |
| Total count of errors encountered while executing SLI queries since the last agent start. Replaced with n9_query_total | N |
n9_query_interval | gaugevec |
| Query interval configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_errors | countervec |
| Total count of errors encountered while uploading data to the Nobl9 platform since the last agent start. Replaced with n9_upload_total | Y |
n9_upload_latency | histogramvec |
| Latency histogram of all data uploads by SLI to the Nobl9 platform | N |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to the accumulator since the agent start | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | counter | Points successfully emitted to the N9 platform since the agent start | Y | |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | counter | Seconds since the last successful config read | Y | |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_query_success_time | count | Seconds since last successful query was sent. Replaced with last_input_successful_response_time | Y | |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_payload_received_total | count | Total count of bytes requested in all responsesâ content since the last agent start. Replaced with input_received_bytes and output_received_bytes | Y | |
n9_payload_sent_total | count | Total count of bytes sent in all requestsâ content since the last agent start. Replaced with input_sent_bytes and output_sent_bytes | Y | |
n9_query_delay | gauge | Query delay configured in seconds. Due to deprecation, look for runtime log with appropriate info | N/A | |
n9_query_errors | countervec |
| Total count of errors encountered while executing SLI queries since the last agent start. Replaced with n9_query_total | N |
n9_query_interval | gaugevec |
| Query interval configured in seconds. Due to deprecation, look for runtime log with appropriate info | N/A |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_errors | countervec |
| Total count of errors encountered while uploading data to the Nobl9 platform since last agent start. Replaced with n9_upload_total | Y |
n9_upload_latency | histogram | Latency histogram of all data uploads by SLI to the Nobl9 platform | N | |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | count | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_last_query_success_time | count | Seconds since the last successful query was sent | Y | |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_payload_received_total | count | Total count of bytes requested in all responsesâ content since the last agent start | Y | |
n9_payload_sent_total | count | Total count of bytes sent in all requestsâ content since the last agent start | Y | |
n9_query_delay | gauge | Query delay configured in seconds | N/A | |
n9_query_errors | countervec |
| Total count of errors encountered while executing SLI queries since the last agent start | N |
n9_query_interval | gauge | Query interval configured in seconds | N/A | |
n9_query_latency | histogram | Latency histogram of all SLI query requests | N | |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gauge | Total count of the SLIs the agent collects data for | Y | |
n9_upload_errors | countervec |
| Total count of errors encountered while uploading data to the Nobl9 platform since the last agent start | Y |
n9_upload_latency | histogram | Latency histogram of all data uploads by SLI to the Nobl9 platform | N | |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metrics available in beta versionsâ
The available agent metrics at the /health
and /metrics
endpoints applicable to beta agent versions is as follows:
- Since 0.78.0-beta
- Since 0.74.0-beta
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to accumulator since agent start | Y |
n9_all_buffer_metrics_dropped | count | Total number of dropped metrics due to the buffer overflow, unlabeled | Y | |
n9_buffer_dropped_metrics | countervec |
| Number of dropped metrics due to the buffer overflow, labeled | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | countervec |
| Points successfully emitted to the N9 platform since the agent start | Y |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | countervec |
| Seconds since the last successful config read | Y |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_latency | histogramvec |
| Latency histogram of all data uploads by SLI to the Nobl9 platform | N |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metric name | Type | Labels | Description | Splunk Obs? |
---|---|---|---|---|
n9_accumulated_points | countervec |
| Points added to accumulator since agent start | Y |
n9_all_buffer_metrics_dropped | count | Total number of dropped metrics due to the buffer overflow, unlabeled | Y | |
n9_buffer_dropped_metrics | countervec |
| Number of dropped metrics due to the buffer overflow, labeled | Y |
n9_buffer_capacity | gauge | Total capacity of the metrics buffer in the Nobl9 agent | Y | |
n9_buffer_load | gauge | Total count of metrics in the buffer that has not yet been successfully uploaded to the Nobl9 platform | Y | |
n9_bytes_received_total | count | Total count of bytes received by the agent since the last start | Y | |
n9_bytes_sent_total | count | Total count of bytes sent since the last agent start | Y | |
n9_emitted_points | countervec |
| Points successfully emitted to the N9 platform since the agent start | Y |
n9_input_received_bytes | countervec |
| Bytes received in input responsesâ bodies since the last agent start | Y |
n9_input_sent_bytes | countervec |
| Bytes sent in input requestsâ bodies since the last agent start | Y |
n9_last_config_update_success_time | countervec |
| Seconds since the last successful config read | Y |
n9_last_input_successful_response_time | gaugevec |
| Seconds since the last successful input response | Y |
n9_last_query_success_time | count | Seconds since the last successful query was sent. Replaced with last_input_successful_response_time | Y | |
n9_last_upload_success_time | count | Seconds since the last successful data upload | Y | |
n9_output_received_bytes | counter | Bytes received in output responsesâ bodies since the last agent start | Y | |
n9_output_sent_bytes | counter | Bytes sent in output requestsâ bodies since the last agent start | Y | |
n9_payload_received_total | count | Total count of bytes requested in all responsesâ content since the last agent start. Replaced with input_received_bytes and output_received_bytes | Y | |
n9_payload_sent_total | count | Total count of bytes sent in all requestsâ content since the last agent start. Replaced with input_sent_bytes and output_sent_bytes | Y | |
n9_query_delay | gauge | Query delay configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A | |
n9_query_errors | countervec |
| Total count of errors encountered while executing SLI queries since the last agent start. Replaced with n9_query_total | N |
n9_query_interval | gaugevec |
| Query interval configured in seconds. Due to deprecation, look for runtime log with the appropriate info | N/A |
n9_query_latency | histogramvec |
| Latency histogram of all SLI query requests | N |
n9_query_total | countervec |
| Total count of all queries ran since the last agent start | N |
n9_sli_total | gaugevec |
| Total count of the SLIs the agent collects data for | Y |
n9_upload_errors | countervec |
| Total count of errors encountered while uploading data to the Nobl9 platform since the last agent start. Replaced with n9_upload_total | Y |
n9_upload_latency | histogramvec |
| Latency histogram of all data uploads by SLI to the Nobl9 platform | N |
n9_upload_total | countervec |
| Total count of all data uploaded to the Nobl9 platform since the last agent start | Y |
n9_uptime | countervec |
| Seconds since the agent start | Y |
Metrics use casesâ
Metrics can help you better understand what the Nobl9 agent is doing in your Kubernetes environment.
Peak in agent's memory utilizationâ
For instance, if you detect a peak in memory usage of an active Nobl9 agent, metrics allow you to scrape agentâs utilization data with Prometheus and display a graph in Grafana to see the agent memory usage over time.
Breaks in agent's operationâ
You might also encounter a situation where Kubernetes reports the pod is in a Running
state, but the agent appears to stop operating at times. By scraping metrics data with Prometheus and displaying graphs with Grafana, you can get details about agentâs operation over time.
For instance, you might see spikes in the graph displaying n9_upload_errors
(deprecated, use n9_upload_total{status_code!~"200"}
) while the graph displaying n9_bytes_sent_total
has stopped growing. At the same time, the panel displaying n9_last_upload_success_time
indicates uploads have stopped, and the graph displaying n9_buffer_load
has also started growing. You can leverage this data with Nobl9 support and your local networking team to help troubleshoot the issue further.
Examples of agent metric visualizationsâ
Using Prometheus, you can collect metrics and analyze them using a tool like Grafana.
- Here's an example of Grafana visualization of the
n9_uptime
metric:
- Here's an example of Grafana visualization of the
n9_query_total
metric for several Nobl9 agents:
- Here's an example of Grafana visualization of the
n9_upload_total
metric: