Skip to main content

Composite SLOs 1.0 use cases

Reading time: 0 minute(s) (0 words)

Example 1: Combining different SLOs for a service to get an overall health score

Our shopping service is a user-facing HTTP server that allows users to make purchases. It is critical for our business for that service to be available and performant. We want to observe its overall reliability: availability, latency, and error rate. We are choosing the timeslices method.

For our service to be considered reliable enough in a given minute, it must be available and handle requests quickly and without errors. If any of those conditions are unmet, we want to subtract that minute from our error budget.

We are creating a composite SLO that represents the above reliability requirements:

apiVersion: n9/v1alpha
kind: SLO
name: shop-http-server-overall-health
project: production
alertPolicies: []
budgetingMethod: Timeslices
target: 0.9
description: |-
This Composite SLO combines 3 different SLOs of the same service: latency, availability and error rate into a single
overall health score.
kind: Direct
name: datadog-prod
project: default
# shop-http-server is deployed in kubernetes cluster with several replicas. In order for this service to be available at least one replica needs to be available.
- name: availability
# We expect at least 1 replica to be available
op: gte
value: 1
# We get information about available replicas from metrics exposed by the Kubernetes cluster:
query: max:kubernetes_state.deployment.replicas_available{kube_cluster_name:prod,service:shop-http-server}
target: 0.9
# Minute is good in terms of availability if service was available throughout the entire minute
timeSliceTarget: 1
- name: error-rate
# We expect error rate to be less than 1%
op: lt
value: 0.01
# We measure the error rate based on the ratio of responses with 500 errors to all responses
query: sum:trace.http.request.hits.by_http_status{cluster_name:prod
AND service:shop-http-server AND http.status_class IN (5xx}.as_count()
/ sum:trace.http.request.hits.by_http_status{cluster_name:prod,service:shop-http-server}.as_count()
target: 0.9
# We consider the minute to be good if, through the entire minute error rate was lower than 1%
timeSliceTarget: 1
- name: latency
# We expect requests to be served in less than 200ms
op: lt
value: 0.2
query: avg:trace.http.request.duration{cluster_name:prod,service:shop-http-server}
target: 0.9
timeSliceTarget: 0.95
service: shop-http-server
# Our team works in one-week iterations. We are setting our time window to 2 weeks because that gives us enough time to address any reliability issues. We will make the error budget tight enough, reflecting the service's importance to our business.
- count: 14
unit: Day
isRolling: true

Example 2: Combining SLOs for different services to get the reliability of a complex data pipeline

We have a simple data processing pipeline that consists of three services that live in our infrastructure.

composite SLO and error budget calculations
Image 1: Data processing pipeline
  1. The data producers send data as HTTP requests to the data-intake service.

  2. The data-intake service posts received data as events in an external queue.

  3. The data-processor consumes events from the queue, transforms their payload, and saves the output to a database.

  4. The data-server queries transforms data in the database and returns results to consumers.

We want to measure the overall availability of the pipeline. We assume the data processing pipeline is available when all its constituents are available. Only data-intake , data-processor and data-server services live in our infrastructure, and only their reliability issues are actionable by our team.

We create a composite SLO that aggregates the availability of individual services in the pipeline:

apiVersion: n9/v1alpha
kind: SLO
name: data-pipeline-availability
project: production
alertPolicies: []
budgetingMethod: Occurrences
op: gt
value: 1
# Composites budget is bigger than budget of individual objectives
target: 0.9
description: ""
kind: Agent
name: my-aws-prometheus
project: default
# The data-intake service receives data sent as http request and posts them to an underlying queueing system. It's availability is measured as ratio of successfull requests to all received requests.
- countMetrics:
promql: sum(http_requests_total{code="2*",env="prod",service="data-intake"})
incremental: false
promql: sum(http_requests_total{env="prod",service="data-intake"})
name: intake
target: 0.95
value: 1
# The data-processor service consumes events from an underlying queue, transform data in event's payload and stores the results in a permanent storage. It's availability is measured as ratio of received events to stored events. It exposes custom metrics that are scraped by AWS Managed Prometheus.
- countMetrics:
promql: sum(stored_events{env="prod",service="data-processor"})
incremental: false
promql: sum(received_events{env="prod",service="data-processor"})
name: processor
target: 0.95
value: 2
# Our data-server runs queries on transformed data and sends results back. It's availability is measured as ratio of successfull requests to all received requests.
- countMetrics:
promql: sum(http_requests_total{code="2*",env="prod",service="data-server"})
incremental: false
promql: sum(http_requests_total{env="prod",service="data-server"})
name: server
target: 0.95
value: 3
service: my-service
- count: 28
isRolling: true
unit: Day