Skip to main content

Replay
Beta

Reading time: 0 minute(s) (0 words)

Replay (currently in beta) lets users retrieve historical SLI data and recalculate their SLO error budgets. You can use this feature when your SLI source data is missing or corrupt or if you want to create a new SLO with historical data.

You can also leverage Replay to backfill your SLO reporting: if you have a backlog of SLI data from the last few days or even weeks, Replay will allow you to fetch that data and use it to recalculate your remaining error budget.

TIPS

With Replay, you can access your historical data minutes after creating an SLO. This allows you to draw conclusions and make adjustments to your metrics much earlier.

Replay pulls in the historical data while your SLO starts collecting new data in real time. The historical and current data are merged, producing an error budget calculated for the entire period.

Scope of support

Currently, the following integrations support Replay (see the requirements table):

  • Amazon CloudWatch
  • AMS Prometheus
  • AppDynamics
  • Azure Monitor
    beta
  • Azure Monitor managed service for Prometheus
    beta
  • Datadog
  • Dynatrace
  • Google Cloud Monitoring
  • Graphite
  • LogicMonitor
    beta
  • New Relic
  • Prometheus
  • ServiceNow Cloud Observability
  • Splunk

Requirements

To use Replay for specific data sources, you may need to update the version of the Nobl9 agent you use. See the table below to determine the minimum agent version required to use this feature:

SourceReplay supportAgent versionDirect supportMax period for historical data retrieval
Amazon CloudWatch0.60.015 days1
AMS Prometheus0.55.030 days
AppDynamics0.68.030 days
Azure Monitor
beta
0.69.0-beta0130 days
Azure Monitor managed service for Prometheus
beta
0.78.0-beta30 days
Datadog0.54.230 days
Dynatrace0.66.028 days 2
Google Cloud Monitoring0.79.0-beta30 days 3
Graphite0.55.030 days
LogicMonitor
beta
0.81.0-beta30 days
New Relic0.56.030 days
Prometheus0.54.230 days
ServiceNow Cloud Observability0.56.030 days
Splunk0.55.0
0.82.0-beta for single-query ratio metrics
30 days
1 Replay for CloudWatch supports only configuration queries.
2 When you run Replay for the maximum period for historical data retrieval for Dynatrace (28 days), remember that due to Dynatrace limitations, there may be one hour of degraded resolution at the beginning of the selected time range.
3 While the average historical data period for Google Could Monitoring is 28 days, it can be less for some metrics. It depends on data retention period. Learn more about Google data retention

Create Replay

To activate Replay for the supported data sources, define the following:

  1. Maximum Period for Historical Data Retrieval, which corresponds to the historicalDataRetrieval.maxDuration object in YAML.

    • The object defines the maximum period for which data can be retrieved:

      • value must be an integer greater than or equal to 0

      • unit must be one of Minute, Hour, or Day

      • It must be a duration that is less than or equal to 30 days

      • It must be a duration that is greater than or equal to the value for the default period (see below). Otherwise, a validation error is returned

  2. Default Period for Historical Data Retrieval, which corresponds to the historicalDataRetrieval.defaultDuration object in YAML.

    • This period will be used by default for any SLOs connected to this data source. This field has the following requirements:

      • value must be an integer greater than or equal to 0

      • unit must be one of Minute, Hour, or Day

      • It must be a duration that is less than or equal to the value for the maximum period. Otherwise, a validation error is returned

      • By default, this value is set to 0. If you set it to >0, you will create an SLO with Replay

tip

You can configure these fields in the UI or in YAML when you set up the data source.

To activate Replay for an SLO, you must complete the following two steps:

  1. Step 1:
    Configure and create the agent/direct using the data source configuration wizard or apply the YAML via sloctl.
  2. Step 2:
    Configure and create Replay for the SLO in the SLO wizard.

Step 1: Create agent/direct

Replay configuration in YAML

Data sources that support Replay accept an additional object called historicalDataRetrieval in their YAML definition (see Configuring Replay) above for an extended description of the field values). Use sloctl apply --replay or sloctl replay commands to run Replay via sloctl:

- apiVersion: n9/v1alpha
kind: Agent
metadata:
name: datadog
project: datadog
spec:
datadog:
site: com
sourceOf:
- Metrics
- Services
# Additional fields related to Replay
historicalDataRetrieval:
maxDuration:
value: 30 # integer greater than or equal to 0
unit: Day # accepted values: Minute, Hour, Day
defaultDuration: # value must be less than or equal to value of maxDuration
value: 0 # integer greater than or equal to 0; defaults to 0
unit: Day # accepted values: Minute, Hour, Day

If the historicalDataRetrieval section is omitted when configuring a data source that supports Replay, the following values are used as defaults:

    historicalDataRetrieval:
maxDuration:
value: 0
unit: Day
defaultDuration:
value: 0
unit: Day

These default values are also used for data sources that support Replay that were configured before the Replay feature was activated.

caution
  • historicalDataRetrieval can't be used for data sources that don't support Replay. Adding it will result in a validation error.

Configuring Replay in the UI

You can find the values for Replay in the Advanced Settings section of the data source configuration wizard (direct and agent):

replay source config
Image 1: Configuring Replay on the data source level

Step 2: Create SLO

Configuring Replay in the SLO wizard

When you start a SLO wizard and pick a data source that has support for Replay, an additional field will be displayed in step 2 of the SLO wizard:

replay slo wizard config
Image 2: Configuring Replay on the SLO wizard level

The Period for Historical Data Retrieval field defines the period that will be used by the SLO:

  • The value displayed is the Default Period for Historical Data Retrieval that you specified when setting up the data source.

  • You can override this value, but you can't exceed the Maximum Period for Historical Data Retrieval specified for this data source. Be aware that entering a more extended period might slow down the loading time of your SLO.

  • The value must be a positive integer or 0.

warning

Beta limitation for SLOs using Replay

The Period for Historical Data Retrieval field does not have a corresponding field in the YAML used to define an SLO.

This field won't be displayed in the UI if the selected data source doesn't support Replay.

By default, the Period for Historical Data Retrieval field is set to the value of historicalDataRetrieval.defaultDuration. This parameter default value is 0. However, it can be set to any duration between 0 and historicalDataRetrieval.maxDuration.

User experience

While historical data is being retrieved, you will notice a few things in the UI:

  1. Charts for the SLO for which data is being retrieved will not be visible in the grid view until the processing of the data is complete:
data loading
Image 3: Loading historical data in the UI
  1. Charts for the SLO will also not be visible in the SLO details view during this time:
replay chart loading
Image 4: Loading historical data on the SLO details level
tip

You will see the updated charts after historical data retrieval is finished.

Restrictions for Replay

Data downsampling

caution

It is important to understand how a given data source alters data from the past. Metric gathering systems usually downsample older data to save space using different aggregate functions like mean or sum or simply by dropping data points. This can affect the result of a query made against a time range in the further past. Consult the documentation of the specific data source for more details.

Limits per organization

Replay and SLI Analyzer share the same mechanism for fetching data. It's allowed to fetch data for two SLOs per organization at the same time.

So, if both allocated slots are busy, the options to replay an SLO or import data for SLI analysis are inactive until at least one ongoing process is finished.

Job Status widget

You can track the progress of ongoing replays and SLI analyses and check for free slots with the Job Status widget.

To access it, click next to the top right of theNobl9 Web:

process widget
Image 5: Job Status widget

Assumptions for Job Status widget:

  • The widget displays 3 most recent replays and analyses (the limit for concurrent replays or analyses +1).

  • When you run an analysis for a completed import job, it'll immediately disappear from the jobs list. This ensures that all recently triggered data import jobs are visible on the widget.

  • All jobs are sorted by status (the in progress status always takes precedence) and last triggered date, with the most recent date displayed at the top.

  • The list may not update as expected if you run a reimport process on an SLO listed on the widget.

    This is because reimport updates an existing record in the database and does not create a new one. For example, if you see 3 replays in the widget:

    Replay1
    Replay2
    Replay3

    If you run a reimport for Replay2, you'll see these processes displayed in the following order on the widget:
    Replay2
    Replay1
    Replay3

Canceling a running Replay

You can't stop or cancel an ongoing Replay process. Wait until it is done. The time it takes depends on the period configured.

Editing an SLO with an ongoing Replay

Editing an SLO with an ongoing Replay can have different consequences depending on what is being edited.

Generally, editing isn't immediately affect the background process. The initially launched Replay will be completed for a snapshot of the SLO at the time of its creation, and the results of that retrieval may or may not be shown for the edited SLO.

The result depends on what fields you edit:

  1. Adding a new objective:

    1. Replaying will be completed for the original objective(s).

    2. An error budget taking into account the historical period will not be calculated for the new objective. Error budget calculation for this objective will begin at the time of its creation.

  2. Removing an existing objective:

    1. Replay results will be abandoned for that objective.

    2. Removing an existing objective doesn't stop the background process of fetching metrics for it. The results are similar to deleting an entire SLO.

  3. Editing an existing objective's value or target:

    1. The edited objective will be treated as a new one, with the results described above.

    2. The original objective will be treated as though it has been removed, again with the results described above.

  4. Modifying the query or data source:

    1. The SLO will be replayed according to the original query and data source.

    2. Error budget calculations will be based on the new query and data source, starting from the moment of editing.

tip

For more details on editing SLOs, see the Editing SLOs guide.

Replay and composite SLOs 1.0

Creating an SLO with Replay is mutually exclusive with configuring an SLO as a composite SLO 1.0. When you create a composite SLO 1.0, you will see the following message in the UI:

replay composite
Image 6: Replay and composite SLOs in the UI

Suppose you’ve created an SLO with Replay activated, and the Replay process is running. Turning this SLO into a composite SLO 1.0, results in the following:

  1. Replay will continue for the original objectives in that SLO.

  2. The composite objective won't include data from the historical period in its error budget calculation. The calculation will start from the moment of the creation of the SLO.

Replay and composite SLOs 2.0

Currently, you can't replay a composite SLO, but only its components.

Replaying components of a composite causes no retroactive changes to the composite data. The replayed component stops reporting data until the process is complete. It is treated according to your maxDelay and, if longer, whenDelayed settings. The overall composite error budget calculations depend on the duration of the Replay process, the component's maxDelay settings, and the existence of components without a delay in the composite.

Non-delayed components?Replay vs. maxDelayResult
YesReplay<maxDelayThe composite pauses for the duration of Replay. Component's data collected after replaying is considered in calculations as usual.
YesReplay>maxDelayComponent's data is considered in calculations according to whenDelayed. Data delayed for the time surplus (once maxDelay ends) is calculated as usual.
NoAny ratioThe composite pauses for the duration of Replay. Upon replaying, the component's data fills the no-data gap.

Running Replay for existing SLOs

caution

Running Replay for existing SLOs is irreversible.

Running Replay also can have an impact on your existing SLOs

Running replay: user experience

Running replay in the UI

To run Replay for an existing SLO:

  1. Go to the SLO Details tab of the SLO in which you wish to run run replay.
  2. In the More actions dropdown menu, select Run Replay:
reimport button
Image 7: Reimport button in the SLO details tab
caution

Remember that if you want to run Replay, the Maximum period for historical data retrieval configured for your data source must be set to >0.

Otherwise, running Replay will be inactive.

Running Replay via sloctl

You can also run Replay using the sloctl replay command. Refer to sloctl user guide for more details.

important

Duration of the Replay process

The Replay process for a single SLO may take up to an hour depending on:

  • The length of the reimported period
  • The number of objectives in your SLO
  • The number of unique queries used in your SLO

Impact of Replay process on SLOs

Running Replay for existing SLO has important consequences on SLI data, and alerts.

Impact on SLI data

  • During the Replay process, live data are still gathered but will be included in an SLO after reimport has been completed.

  • Replay will query the data source again for the entire selected historical period. These results will completely replace SLI data already gathered for the same period.

  • Data resolution might be lower due to the downsampling of historical data depending on the data source you use. Because of that, the SLI chart might look different after the reimport process has been completed, even if it was run for the same query.

  • Replay won't fill periods with no data with the original data. The gap in data will be replaced by Replay, as in the example below:

    • Original input SLI data:
    2023-01-01 01:20:00 = 100
    2023-01-01 01:21:00 = 230
    2023-01-01 01:22:00 = 270
    2023-01-01 01:24:00 = 220
    2023-01-01 01:25:00 = 130
    2023-01-01 01:26:00 = 280
    2023-01-01 01:27:00 = 200
    • Reimported SLI data:
    2023-01-01 01:20:00 = 100
    2023-01-01 01:21:00 = 230
    [...] # Gap in the data stream
    2023-01-01 01:28:00 = 90
    2023-01-01 01:29:00 = 220
    2023-01-01 01:30:00 = 270
    2023-01-01 01:31:00 = 190
    • SLI data after the reimport process is completed:
    2023-01-01 01:20:00 = 100
    2023-01-01 01:21:00 = 230
    [...] # Gap in the data stream
    2023-01-01 01:28:00 = 90
    2023-01-01 01:29:00 = 220
    2023-01-01 01:30:00 = 270
    2023-01-01 01:31:00 = 190
    • This can happen when the retention period of the data source is shorter than the period selected for Replay.
    • To avoid this, always set the Maximum Period for Historical Data Retrieval to a value equal to or lower than data source's retention period.

Impact on alerts

  • You won't receive any alerts from that SLO during the reimport process.
  • After Replay is done, you won't receive alerts for the reimported historical period that was recalculated.
  • After reimporting, you might receive missed alerts when Replay was running. These alerts will be triggered based on recalculated data.

Replay—API rate limits

Source1 API request pulls (in historical hours of data)
Amazon CloudWatch24
AMS Prometheus24
Datadog4
Dynatrace12
New Relic80 (minutes)
ServiceNow Cloud Observability24
Prometheus24
Splunk24

These requests will count toward the data source’s API rate limit and the requests used to fetch current SLI data (see here for details on Datadog’s rate limiting). Exceeding your rate limit will cause delays in fetching SLI data and prolong the historical data retrieval process.

Useful links

For a more in-depth look, consult additional resources: