Grafana Loki

Reading time: 0 minute(s) (0 words)

Grafana Loki (or Loki) is a horizontally scalable, multi-tenant log aggregation system that is extremely easy to operate. Loki does not index the contents of the logs, but rather a set of labels for each log stream. Nobl9 users can leverage Loki to query and build metrics on top of their logs.

Grafana Loki parameters and supported features in Nobl9

General support:: Release channel: Stable, Beta; Connection method: Agent; Replay and SLI Analyzer: Not supported; Event logs: Not supported; Query checker: Not supported; Query parameters retrieval: Supported; Timestamp cache persistence: Not supported
Query parameters:: Query interval: 1 min; Query delay: 1 min; Jitter: 15 sec; Timeout: 30 sec
Agent details and minimum required versions for supported features:: Plugin name: n9grafana_loki; Query delay environment variable: GRAFANA_LOKI_QUERY_DELAY; Query parameters retrieval: 0.73.2; Custom HTTP headers: 0.88.0 / 0.88.0-beta

Creating SLOs with Grafana Loki

Nobl9 Web

Follow the instructions below to create your SLOs with Grafana Loki in the UI:

Navigate to Service Level Objectives.
Click .
In step 1 of the SLO wizard, select the Service the SLO will be associated with.
In step 2, select Grafana Loki as the data source for your SLO.
Specify the Metric. You can choose either a Threshold Metric, where a single time series is evaluated against a threshold or a Ratio Metric, which allows you to enter two time series to compare (for example, a count of good requests and total requests).
Choose the Data Count Method for your ratio metric:

Non-incremental: counts incoming metric values one-by-one. So the resulting SLO graph is pike-shaped.
Incremental: counts the incoming metric values incrementally, adding every next value to previous values. It results in a constantly increasing SLO graph.

Enter a Query, or Good Query and Total Query for the metric you selected.
Refer to the Query Examples section below for more details.

SLI values for good and total

When choosing the query for the ratio SLI (countMetrics), keep in mind that the values resulting from that query for both good and total:

Must be positive.
While we recommend using integers, fractions are also acceptable.

If using fractions, we recommend them to be larger than 1e-4 = 0.0001.

Shouldn't be larger than 1e+20.

In step 3, define a Time Window for the SLO.

Rolling time windows are better for tracking the recent user experience of a service.
Calendar-aligned windows are best suited for SLOs that are intended to map to business metrics measured on a calendar-aligned basis, such as every calendar month or every quarter.

In step 4, specify the Error Budget Calculation Method and your Objective(s).

Occurrences method counts good attempts against the count of total attempts.
Time Slicesmethod measures how many good minutes were achieved (when a system operates within defined boundaries) during a time window.
You can define up to 12 objectives for an SLO.

See the use case example and the SLO calculations guide for more information on the error budget calculation methods.

In step 5, add the Display name, Name, and other settings for your SLO:

Set notification on no data, if this option is available for your data source.
When activated, Nobl9 notifies you if your SLO hasn't received data or received incomplete data for more than 15 minutes.
Add alert policies, labels, and links, if required.
You can add up to 20 links per SLO.

Click Create SLO.

sloctl

Threshold (rawMetric)
Ratio (countMetric)

Sample Grafana Loki threshold SLO
apiVersion: n9/v1alpha
kind: SLO
metadata:
  name: api-server-slo
  displayName: API Server SLO
  project: default
  labels:
    area:
      - latency
      - slow-check
    env:
      - prod
      - dev
    region:
      - us
      - eu
    team:
      - green
      - sales
  annotations:
    area: latency
    env: prod
    region: us
    team: sales
spec:
  description: Example Grafana Loki SLO
  indicator:
    metricSource:
      name: grafana-loki
      project: default
      kind: Agent
  budgetingMethod: Occurrences
  objectives:
    - displayName: Good response (200)
      value: 200
      name: ok
      target: 0.95
      rawMetric:
        query:
          grafanaLoki:
            logql: >-
              sum(sum_over_time({topic="cdc"} |= "kafka_consumergroup_lag" |
              logfmt | line_format "{{.kafka_consumergroup_lag}}" | unwrap
              kafka_consumergroup_lag [1m]))
      op: lte
      primary: true
  service: api-server
  timeWindows:
    - unit: Month
      count: 1
      isRolling: false
      calendar:
        startTime: '2022-12-01 00:00:00'
        timeZone: UTC
  alertPolicies:
    - fast-burn-5x-for-last-10m
  attachments:
    - url: https://docs.nobl9.com
      displayName: Nobl9 Documentation
  anomalyConfig:
    noData:
      alertMethods:
        - name: slack-notification
          project: default
      alertAfter: 1h

Sample Grafana Loki ratio SLO
apiVersion: n9/v1alpha
kind: SLO
metadata:
  name: api-server-slo
  displayName: API Server SLO
  project: default
  labels:
    area:
      - latency
      - slow-check
    env:
      - prod
      - dev
    region:
      - us
      - eu
    team:
      - green
      - sales
  annotations:
    area: latency
    env: prod
    region: us
    team: sales
spec:
  description: Example Grafana Loki SLO
  indicator:
    metricSource:
      name: grafana-loki
      project: default
      kind: Agent
  budgetingMethod: Occurrences
  objectives:
    - displayName: Good response (200)
      value: 1
      name: ok
      target: 0.95
      countMetrics:
        incremental: true
        good:
          grafanaLoki:
            logql: >-
              count(count_over_time(({component="api-server"} | json |
              line_format "{{.log}}" | json | http_status_code >= 200 and
              http_status_code < 300)[1m]))
        total:
          grafanaLoki:
            logql: >-
              count(count_over_time(({component="api-server"} | json |
              line_format "{{.log}}" | json | http_status_code > 0)[1m]))
      primary: true
  service: api-server
  timeWindows:
    - unit: Month
      count: 1
      isRolling: false
      calendar:
        startTime: '2022-12-01 00:00:00'
        timeZone: UTC
  alertPolicies:
    - fast-burn-5x-for-last-10m
  attachments:
    - url: https://docs.nobl9.com
      displayName: Nobl9 Documentation
  anomalyConfig:
    noData:
      alertMethods:
        - name: slack-notification
          project: default
      alertAfter: 1h

Metrics for Grafana Loki have one mandatory field:

logql is a query written in the PromQL (Prometheus Query Language). For more details, refer to Introduction to PromQL | Grafana documentation. You can see working examples of Grafana Loki queries in the Query examples section below.

Query examples

Ratio metric for Grafana Loki:
Good Query:
count(count_over_time(({app="nobl9", component="ingest", container="ingest container"} | json | line_format "{{.log}}" | json | http_useragent != "ELB-HealthChecker/2.0" | http_status_code >= 200 and http_status_code < 300)[1m]))
Total Query:
count(count_over_time(({app="nobl9", component="ingest", container="ingest-container"} | json | line_format "{{.log}}" | json | http_useragent != "ELB-HealthChecker/2.0" | http_status_code > 0)[1m]))

Querying the Grafana Loki server

Nobl9 calls Loki API every minute to retrieve the log data from the previous minute. Nobl9 aggregates the total number of points to 4 per minute.

caution

Users should refrain from adding duration and Nobl9 will append [1m] to the query.

Useful links

For a more in-depth look, consult additional resources:

Add Grafana Loki as a data sourceAdding data sources

Grafana HTTP APIGrafana Loki documentation

Introduction to PromQLGrafana documentation

Creating SLOs via TerraformTerraform

Creating SLOs with Grafana Loki​

Nobl9 Web​

sloctl​

Query examples​

Querying the Grafana Loki server​

Useful links​