Skip to main content

Replay
Beta

Replay (currently in beta) enables users to retrieve historical SLI data and recalculate their SLO error budgets. You can use this feature when your SLI source data is missing or corrupt or if you want to create a new SLO with historical data.

You can also leverage Replay to backfill your SLO reporting: if you have a backlog of SLI data from the last few days or even weeks, Replay will allow you to fetch that data and use it to recalculate your remaining error budget.

TIPS

With Replay, you can access your historical data minutes after creating an SLO. This allows you to draw conclusions and make adjustments to your metrics much earlier.

Replay pulls in the historical data while your SLO starts collecting new data in real time. The historical and current data are merged, producing an error budget calculated for the entire period.

Scope of Support

Currently, the following integrations support Replay:

  • Datadog (for both the Agent and Direct connection methods)
  • Prometheus (only for the Agent connection mode)
  • AMS Prometheus (only for the Agent connection mode)
  • Splunk (for both the Agent and Direct connection methods)
  • AMS Prometheus (only for the Agent connection mode)
  • New Relic (for both the Agent and Direct connection methods)
  • Lightstep (for both the Agent and Direct connection methods)

Other Sources will be supported soon; see the table below for details.

note

The beta version of Replay can only be used when creating new SLOs. The ability to pull historical data for existing SLOs will be available soon.

Requirements

To use Replay for specific data sources, you may need to update the version of the Nobl9 Agent you use. See the table below to determine the minimum Agent version required to use this feature:

SourceReplay SupportAgent VersionDirect SupportMax Period for Historical Data Retrieval
AMS PrometheusY0.55.0N30 days
DatadogY0.54.2Y30 days
GraphiteY0.55.0Y30 days
LightstepY0.56.0Y30 days
New RelicY0.56.0Y30 days
PrometheusY0.54.2N30 days
SplunkY0.55.0Y30 days

Create Replay

There are two fields that you must define to enable Replay for data sources that support it:

  1. Maximum Period for Historical Data Retrieval, which corresponds to the historicalDataRetrieval.[n].maxDuration object in YAML.

    • The object defines the maximum period for which data can be retrieved:

      • value must be an integer greater than or equal to 0.

      • unit must be one of Minute, Hour, or Day.

      • Must be a duration that is less than or equal to 30 days.

      • Must be a duration that is greater than or equal to the value for the default period (see below). Otherwise, a validation error is returned.

  1. Default Period for Historical Data Retrieval, which corresponds to the historicalDataRetrieval.[n].defaultDuration object in YAML.

    • This period will be used by default for any SLOs connected to this data source. This field has the following requirements:

      • value must be an integer greater than or equal to 0.

      • unit must be one of Minute, Hour, or Day.

      • It must be a duration that is less than or equal to the value for the maximum period. Otherwise, a validation error is returned.

tip

You can configure these fields in the UI or in YAML when you set up the data source.

To enable Replay for an SLO, you must complete the following two steps:

  1. Step 1:
    Configure and create the Agent/Direct using the data source configuration wizard or apply the YAML via sloctl.
  2. Step 2:
    Configure and create Replay for the SLO in the SLO Wizard.

Step 1: Create Agent/Direct

Replay Configuration in YAML

warning
  • Replay configuration via sloctl is not supported in beta, use the UI to configure Replay for your Source.
  • In beta, you can only retrieve Replay configuration for your existing Sources through the sloctl get command.
Click to see how Replay is defined in the YAMLs.

Data sources that support Replay accept an additional object called historicalDataRetrieval in their YAML definition (see Configuring Replay) above for an extended description of the field values):

- apiVersion: n9/v1alpha
kind: Agent
metadata:
name: datadog
project: datadog
spec:
datadog:
site: com
sourceOf:
- Metrics
- Services
# Additional fields related to Replay
historicalDataRetrieval:
maxDuration:
value: 30 # integer greater than or equal to 0
unit: Day # accepted values: Minute, Hour, Day
defaultDuration: # value must be less than or equal to value of maxDuration
value: 7 # integer greater than or equal to 0
unit: Day # accepted values: Minute, Hour, Day

If the historicalDataRetrieval section is omitted when configuring a data source that supports Replay, the following values are used as defaults:

    historicalDataRetrieval:
maxDuration:
value: 0
unit: Day
defaultDuration:
value: 0
unit: Day

These default values are also used for data sources that support Replay that were configured before the Replay feature was enabled.

caution
  • historicalDataRetrieval can't be used for data sources that don't support Replay. Adding it will result in a validation error.

  • Replay can only be used for new SLOs created via the SLO Wizard. Currently, it cannot be enabled retroactively for existing SLOs.

Configuring Replay in the UI

You can find the values for Replay in the Advanced Settings section of the data source configuration wizard (Direct and Agent):

replay source config
Image 1: Configuring Replay on the Data source level

Step 2: Create SLO

caution

You can only configure Replay for an SLO in the Nobl9 UI.

Configuring Replay in the SLO Wizard

When you start a SLO Wizard and pick a data source that has support for Replay, an additional field will be displayed in step 2 of the SLO Wizard:

replay slo wizard config
Image 2: Configuring Replay on the SLO Wizard level

The Period for Historical Data Retrieval field defines the period that will be used by the SLO:

  • The value displayed is the Default Period for Historical Data Retrieval that you specified when setting up the data source.

  • You can override this value, but you can't exceed the Maximum Period for Historical Data Retrieval specified for this data source. Be aware that entering a more extended period might slow down the loading time of your SLO.

  • The value must be a positive integer or 0.

warning

Beta Limitation for SLOs Using Replay

The Period for Historical Data Retrieval field does not have a corresponding field in the YAML used to define an SLO.

This field won't be displayed in the UI if the selected data source doesn't support Replay.

By default, the Period for Historical Data Retrieval field is set to the value of historicalDataRetrieval.defaultDuration for the selected data source. It can be set to any duration between 0 and historicalDataRetrieval.maxDuration.

User Experience

While historical data is being retrieved, you will notice a few things in the UI:

  1. Charts for the SLO for which data is being retrieved will not be visible in the Grid View until the processing of the data is complete:

    data loading
    Image 3: Loading historical data in the UI
  1. Charts for the SLO will also not be visible in the SLO details view during this time:
replay chart loading
Image 4: Loading historical data on the SLO details level
tip

You will see the updated charts after historical data retrieval is finished.

Restrictions for Replay

Data Downsampling

caution

It is important to understand how a given data source alters data from the past. Metric gathering systems usually downsample older data to save space using different aggregate functions like mean or sum or simply by dropping data points. This can affect the result of a query made against a time range in the further past. Consult the documentation of the specific data source for more details.

Limits per Organization

Historical data retrieval can only be performed for one SLO at a time per organization. If you attempt to create an SLO with Replay enabled while another SLO is retrieving historical data, you will see the following note in the UI:

replay limit
Image 5: Limit for simultaneous Replay processes

The Period for Historical Data Retrieval field will be inactive and set to 0.

Canceling a Running Historical Retrieval Process

You can't stop or cancel a historical data retrieval process that is in progress. You must wait until it is done, which, depending on the period configured, could take up to around dozen of minutes.

Editing a Running Historical Retrieval Process

Editing an SLO with Replay enabled while historical data retrieval is in progress will have different consequences depending on the type of edit made (see below).

Generally, the edit action will not immediately affect the background process. The initially requested data retrieval process will be completed for a snapshot of the SLO at the time of its creation, and the results of that retrieval may or may not be shown for the edited SLO.

The result depends on what fields you edit:

  1. Adding a new Objective:

    1. Historical data retrieval will be completed for the original Objective(s).

    2. An error budget taking into account the historical period will not be calculated for the new Objective. Error budget calculation for this Objective will begin at the time of its creation.

  2. Removing an existing Objective:

    1. Historical data retrieval results will be abandoned for that Objective.

    2. Removing an existing Objective doesn't stop the background process of fetching metrics for it. The results are similar to deleting an entire SLO.

  3. Editing an existing Objective's Value or Target:

    1. The edited Objective will be treated as a new one, with the results described above.

    2. The original Objective will be treated as though it has been removed, again with the results described above.

  4. Modifying the Query or Data Source:

    1. Historical data will still be retrieved for the original Query and Data Source.

    2. Error budgets calculated from the moment of the edit will use data from the new Query and Data Source.

tip

For more details on editing SLOs, see the Editing SLOs guide.

Replay and Composite SLOs

Creating an SLO with historical data retrieval is mutually exclusive with configuring an SLO as a Composite SLO. When you create a Composite SLO, you will see the following message in the UI:

replay composite
Image 6: Replay and Composite SLOs in the UI

Suppose you’ve created an SLO with Replay enabled, and the historical data retrieval process is running. If you edit that SLO to make it a Composite SLO, this will result in the following consequences:

  1. Historical data retrieval will continue for the original Objectives in that SLO.

  2. The composite Objective won't include data from the historical period in its error budget calculation. The calculation will start from the moment of the creation of the SLO.

Replay - API Rate Limits

Source1 API request pulls (in historical hours of data)
AMS Prometheus24
Datadog4
New Relic80 (minutes)
Lightstep24
Prometheus24
Splunk24

These requests will count toward the data source’s API rate limit along with the requests used to fetch current SLI data (see [here](https://docs.datadoghq.com/api/latest/rate-limits/) for details on Datadog’s rate limiting). Exceeding your rate limit will cause delays in fetching SLI data and prolong the historical data retrieval process.

Replay Troubleshooting

Data Dog

If you exceed Datadog's API rate limit for SLOs with Replay, Replay will attempt to fetch data for 20 minutes. If all attempts fail, Nobl9 will create a standard SLO without historical data (that is, Nobl9 will collect data for such SLO from the time of its creation).

Lightstep - Missing Data

Lightstep does not recognize the distinction between missing data and valid data with a 0 value in the stream. In such cases, Lightstep considers these values to be equal and returns the 0 value.

New Relic

Before running Replay, check your New Relic data retention settings to ensure that all your historical data will be collected.

Prometheus

Since Prometheus is an on-site solution, if Replay keeps exceeding limits, remember to increase the limits on your side.

Incorrect Source Configuration

If you configure your Source incorrectly (for example incorrect credentials, incorrect query), Replay will attempt to fetch data for 20 minutes. When the last attempt is unsuccessful, Replay job will fail. In result, you will see the No data for this time period error in the UI:

replay composite
Image 7: Failed Replay job
note

If you enter an incorrect query for Data Dog, Replay will fail immediately and you will see the No data for this time period error in the UI.

Data Loading Time

Please note that the time for loading the historical data in Replay beta shouldn't exceed 2 hours (for extended periods of historical data retrieval). If your Replay process takes longer, contact Nobl9 Support.