Use cases of composite SLOs
Example 1: Combining different SLOs for a service to get an overall health score
Our shopping service is a user-facing HTTP server that allows users to make purchases. It is critical for our business for that service to be available and performant. We want to observe its overall reliability: availability, latency, and error rate. We are choosing the timeslices method.
For our service to be considered reliable enough in a given minute, it must be available and handle requests quickly and without errors. If any of those conditions are unmet, we want to subtract that minute from our error budget.
We are creating a composite SLO that represents the above reliability requirements:
apiVersion: n9/v1alpha
kind: SLO
metadata:
name: shop-http-server-overall-health
project: production
spec:
alertPolicies: []
budgetingMethod: Timeslices
composite:
target: 0.9
description: |-
This Composite SLO combines 3 different SLOs of the same service: latency, availability and error rate into a single
overall health score.
indicator:
metricSource:
kind: Direct
name: datadog-prod
project: default
objectives:
# shop-http-server is deployed in kubernetes cluster with several replicas. In order for this service to be available at least one replica needs to be available.
- name: availability
# We expect at least 1 replica to be available
op: gte
value: 1
rawMetric:
query:
datadog:
# We get information about available replicas from metrics exposed by the Kubernetes cluster:
query: max:kubernetes_state.deployment.replicas_available{kube_cluster_name:prod,service:shop-http-server}
target: 0.9
# Minute is good in terms of availability if service was available throughout the entire minute
timeSliceTarget: 1
- name: error-rate
# We expect error rate to be less than 1%
op: lt
value: 0.01
rawMetric:
query:
datadog:
# We measure the error rate based on the ratio of responses with 500 errors to all responses
query: sum:trace.http.request.hits.by_http_status{cluster_name:prod
AND service:shop-http-server AND http.status_class IN (5xx}.as_count()
/ sum:trace.http.request.hits.by_http_status{cluster_name:prod,service:shop-http-server}.as_count()
target: 0.9
# We consider the minute to be good if, through the entire minute error rate was lower than 1%
timeSliceTarget: 1
- name: latency
# We expect requests to be served in less than 200ms
op: lt
value: 0.2
rawMetric:
query:
datadog:
query: avg:trace.http.request.duration{cluster_name:prod,service:shop-http-server}
target: 0.9
timeSliceTarget: 0.95
service: shop-http-server
timeWindows:
# Our team works in one-week iterations. We are setting our time window to 2 weeks because that gives us enough time to address any reliability issues. We will make the error budget tight enough, reflecting the service's importance to our business.
- count: 14
unit: Day
isRolling: true
Example 2: Combining SLOs for different services to get the reliability of a complex data pipeline
We have a simple data processing pipeline that consists of three services that live in our infrastructure.
-
The data
producers
send data as HTTP requests to thedata-intake
service. -
The
data-intake
service posts received data as events in an external queue. -
The
data-processor
consumes events from the queue, transforms their payload, and saves the output to a database. -
The
data-server
queries transforms data in the database and returns results toconsumers
.
We want to measure the overall availability of the pipeline. We assume the data processing pipeline is available when all its constituents are available. Only data-intake
, data-processor
and data-server
services live in our infrastructure, and only their reliability issues are actionable by our team.
We create a composite SLO that aggregates the availability of individual services in the pipeline:
apiVersion: n9/v1alpha
kind: SLO
metadata:
name: data-pipeline-availability
project: production
spec:
alertPolicies: []
budgetingMethod: Occurrences
composite:
burnRateCondition:
op: gt
value: 1
# Composites budget is bigger than budget of individual objectives
target: 0.9
description: ""
indicator:
metricSource:
kind: Agent
name: my-aws-prometheus
project: default
objectives:
# The data-intake service receives data sent as http request and posts them to an underlying queueing system. It's availability is measured as ratio of successfull requests to all received requests.
- countMetrics:
good:
amazonPrometheus:
promql: sum(http_requests_total{code="2*",env="prod",service="data-intake"})
incremental: false
total:
amazonPrometheus:
promql: sum(http_requests_total{env="prod",service="data-intake"})
name: intake
target: 0.95
value: 1
# The data-processor service consumes events from an underlying queue, transform data in event's payload and stores the results in a permanent storage. It's availability is measured as ratio of received events to stored events. It exposes custom metrics that are scraped by AWS Managed Prometheus.
- countMetrics:
good:
amazonPrometheus:
promql: sum(stored_events{env="prod",service="data-processor"})
incremental: false
total:
amazonPrometheus:
promql: sum(received_events{env="prod",service="data-processor"})
name: processor
target: 0.95
value: 2
# Our data-server runs queries on transformed data and sends results back. It's availability is measured as ratio of successfull requests to all received requests.
- countMetrics:
good:
amazonPrometheus:
promql: sum(http_requests_total{code="2*",env="prod",service="data-server"})
incremental: false
total:
amazonPrometheus:
promql: sum(http_requests_total{env="prod",service="data-server"})
name: server
target: 0.95
value: 3
service: my-service
timeWindows:
- count: 28
isRolling: true
unit: Day