canari.data_process#

Data processing tools for time series.

class canari.data_process.DataProcess(data: DataFrame, train_start: str | None = None, train_end: str | None = None, validation_start: str | None = None, validation_end: str | None = None, test_start: str | None = None, test_end: str | None = None, train_split: float | None = None, validation_split: float | None = 0.0, test_split: float | None = 0.0, time_covariates: List[str] | None = None, output_col: list[int] = [0], standardization: bool | None = True)[source]#

Bases: object

This module provides the DataProcess class to facilitate:

  • Standardization of datasets based on training statistics

  • Splitting data into training, validation, and test sets

  • Adding time covariates (hour, day, month, etc.) to data

  • Generating lagged versions of features

  • Adding synthetic anomalies to data

Parameters:
  • data (pd.DataFrame) – Input DataFrame with a datetime or numeric index.

  • train_start (Optional[str]) – Start index for training set.

  • train_end (Optional[str]) – End index for training set.

  • validation_start (Optional[str]) – Start index for validation set.

  • validation_end (Optional[str]) – End index for validation set.

  • test_start (Optional[str]) – Start index for test set.

  • test_end (Optional[str]) – End index for test set.

  • train_split (Optional[float]) – Fraction of data for training set.

  • validation_split (Optional[float]) – Fraction for validation set.

  • test_split (Optional[float]) – Fraction for test set.

  • time_covariates (Optional[List[str]]) – Time covariates added to dataset

  • output_col (list[int]) – Column’s indice for target variable.

  • standardization (Optional[bool]) – Whether to apply data standardization (zero mean, unit standard deviation).

data#

The full dataset, including any added time covariates.

Type:

pd.DataFrame

train_start#

Start index for training set.

Type:

Optional[str]

train_end#

End index for training set.

Type:

Optional[str]

validation_start#

Start index for validation set.

Type:

Optional[str]

validation_end#

End index for validation set.

Type:

Optional[str]

test_start#

Start index for test set.

Type:

Optional[str]

test_end#

End index for test set.

Type:

Optional[str]

standardization#

Whether standardization is applied. Defaults to True.

Type:

Optional[bool]

train_split#

Fraction of data used for training.

Type:

Optional[float]

validation_split#

Fraction of data used for validation.

Type:

Optional[float]

test_split#

Fraction of data used for testing.

Type:

Optional[float]

time_covariates#

Time covariates to add (e.g., “hour_of_day”).

Type:

Optional[List[str]]

output_col#

Indices of columns used as output/target variables.

Type:

list[int]

scale_const_mean#

Mean values used for standardization.

Type:

Optional[list]

scale_const_std#

Std dev values used for standardization.

Type:

Optional[list]

Examples

>>> import pandas as pd
>>> from canari import DataProcess
>>> dt_index = pd.date_range(start="2025-01-01", periods=11, freq="H")
>>> data = pd.DataFrame({'value': np.linspace(0.1, 1.0, 11)},
                        index=dt_index)
>>> dp = DataProcess(data,
                    train_split=0.7,
                    validation_split=0.2,
                    test_split=0.1,
                    time_covariates = ["hour_of_day"],
                    standardization=True,)
static add_lagged_columns(data: DataFrame, lags_per_column: list[int]) DataFrame[source]#

Add lagged versions of each column in the dataset, then add to the dataset as new columns.

Parameters:
  • data (pd.DataFrame) – Input DataFrame with datetime index.

  • lags_per_column (list[int]) – Number of lags per column.

Returns:

New DataFrame with lagged columns.

Return type:

pd.DataFrame

Examples

>>> data_lag = DataProcess.add_lagged_columns(data, [2])
static add_synthetic_anomaly(data: Dict[str, ndarray], num_samples: int, slope: List[float], anomaly_start: float | None = 0.33, anomaly_end: float | None = 0.66) List[Dict[str, ndarray]][source]#

Add randomly generated synthetic anomalies to the original data. From the orginal data, choose a window between anomaly_start and anomaly_end (ratio: 0-1). Following a uniform distribution, it randomly chooses within this window where the anomaly starts. After the anomaly start, the data is linearly shifted with a rate of change define by slope.

Parameters:
  • data (dict) – Data dict with “x” and “y”.

  • num_samples (int) – Number of anomalies to generate.

  • slope (list[float]) – Slope for an anomaly.

  • anomaly_start (float, optional) – Start of the anomaly window (0-1). Defaults to 0.33.

  • anomaly_end (float, optional) – End of the anomaly window (0-1). Defaults to 0.66.

Returns:

Data dicts with anomalies injected.

Return type:

list

Examples

>>> train_set, val_set, test_set, all_data = dp.get_splits()
>>> train_set_with_anomaly = DataProcess.add_synthetic_anomaly(
                                train_set,
                                num_samples=2,
                                slope=[0.01, 0.1],
                            )
static decompose_data(data) Tuple[ndarray, float, ndarray, ndarray][source]#

Decompose a time series into a linear trend, seasonality, and residual following:

  • Use a Fourier transform to estimate seasonality.

  • Deseasonalized_data = data - seasonality

  • Estimate a linear trend by fitting Deseasonalized_data with a first order polynomial

  • Estimate residual = data - trend - seasonality

Parameters:

data (np.ndarray) – 1D array.

Returns:

(trend, slope_of_trend, seasonality, residual)

Return type:

tuple

Examples

>>> train_set, val_set, test_set, all_data = dp.get_splits()
>>> trend, slope_of_trend, seasonality, residual = DataProcess.decompose_data(train_set["y"].flatten())
get_data(split: str, standardization: bool | None = False, column: int | None = None) ndarray[source]#

Return a specific column’s values for a given data split.

Parameters:
  • split (str) – One of [‘train’, ‘validation’, ‘test’, ‘all’].

  • Standardization (bool) – Whether to standardize the output.

  • column (Optional[int]) – Column index.

Returns:

The extracted values.

Return type:

np.ndarray

Examples

>>> values = dp.get_data(split="train", standardization=True, column=[0])
get_split_indices() Tuple[array, array, array][source]#

Get the index ranges for the train, validation, and test splits.

Returns:

Train, validation, and test indices.

Return type:

Tuple[np.array, np.array, np.array]

Examples

>>> train_index, val_index, test_index = dp.get_split_indices()
get_splits() Tuple[Dict[str, ndarray], Dict[str, ndarray], Dict[str, ndarray]][source]#

Return training, validation, and test splits

get_time(split: str) ndarray[source]#

Get datetime indices corresponding to a given split.

Parameters:

split (str) – One of [‘train’, ‘validation’, ‘test’, ‘all’].

Returns:

Array of timestamps.

Return type:

np.ndarray

Examples

>>> time = dp.get_time(split="train")
standardize_data() ndarray[source]#

Standardize the data using training statistics.

Returns:

Standardized dataset.

Return type:

np.ndarray