canari.data_process#
Data processing tools for time series.
- class canari.data_process.DataProcess(data: DataFrame, train_start: str | None = None, train_end: str | None = None, validation_start: str | None = None, validation_end: str | None = None, test_start: str | None = None, test_end: str | None = None, train_split: float | None = None, validation_split: float | None = 0.0, test_split: float | None = 0.0, time_covariates: List[str] | None = None, output_col: list[int] = [0], standardization: bool | None = True)[source]#
Bases:
object
This module provides the DataProcess class to facilitate:
Standardization of datasets based on training statistics
Splitting data into training, validation, and test sets
Adding time covariates (hour, day, month, etc.) to data
Generating lagged versions of features
Adding synthetic anomalies to data
- Parameters:
data (pd.DataFrame) – Input DataFrame with a datetime or numeric index.
train_start (Optional[str]) – Start index for training set.
train_end (Optional[str]) – End index for training set.
validation_start (Optional[str]) – Start index for validation set.
validation_end (Optional[str]) – End index for validation set.
test_start (Optional[str]) – Start index for test set.
test_end (Optional[str]) – End index for test set.
train_split (Optional[float]) – Fraction of data for training set.
validation_split (Optional[float]) – Fraction for validation set.
test_split (Optional[float]) – Fraction for test set.
time_covariates (Optional[List[str]]) – Time covariates added to dataset
output_col (list[int]) – Column’s indice for target variable.
standardization (Optional[bool]) – Whether to apply data standardization (zero mean, unit standard deviation).
- data#
The full dataset, including any added time covariates.
- Type:
pd.DataFrame
- train_start#
Start index for training set.
- Type:
Optional[str]
- train_end#
End index for training set.
- Type:
Optional[str]
- validation_start#
Start index for validation set.
- Type:
Optional[str]
- validation_end#
End index for validation set.
- Type:
Optional[str]
- test_start#
Start index for test set.
- Type:
Optional[str]
- test_end#
End index for test set.
- Type:
Optional[str]
- standardization#
Whether standardization is applied. Defaults to True.
- Type:
Optional[bool]
- train_split#
Fraction of data used for training.
- Type:
Optional[float]
- validation_split#
Fraction of data used for validation.
- Type:
Optional[float]
- test_split#
Fraction of data used for testing.
- Type:
Optional[float]
- time_covariates#
Time covariates to add (e.g., “hour_of_day”).
- Type:
Optional[List[str]]
- output_col#
Indices of columns used as output/target variables.
- Type:
list[int]
- scale_const_mean#
Mean values used for standardization.
- Type:
Optional[list]
- scale_const_std#
Std dev values used for standardization.
- Type:
Optional[list]
Examples
>>> import pandas as pd >>> from canari import DataProcess >>> dt_index = pd.date_range(start="2025-01-01", periods=11, freq="H") >>> data = pd.DataFrame({'value': np.linspace(0.1, 1.0, 11)}, index=dt_index) >>> dp = DataProcess(data, train_split=0.7, validation_split=0.2, test_split=0.1, time_covariates = ["hour_of_day"], standardization=True,)
- static add_lagged_columns(data: DataFrame, lags_per_column: list[int]) DataFrame [source]#
Add lagged versions of each column in the dataset, then add to the dataset as new columns.
- Parameters:
data (pd.DataFrame) – Input DataFrame with datetime index.
lags_per_column (list[int]) – Number of lags per column.
- Returns:
New DataFrame with lagged columns.
- Return type:
pd.DataFrame
Examples
>>> data_lag = DataProcess.add_lagged_columns(data, [2])
- static add_synthetic_anomaly(data: Dict[str, ndarray], num_samples: int, slope: List[float], anomaly_start: float | None = 0.33, anomaly_end: float | None = 0.66) List[Dict[str, ndarray]] [source]#
Add randomly generated synthetic anomalies to the original data. From the orginal data, choose a window between anomaly_start and anomaly_end (ratio: 0-1). Following a uniform distribution, it randomly chooses within this window where the anomaly starts. After the anomaly start, the data is linearly shifted with a rate of change define by slope.
- Parameters:
data (dict) – Data dict with “x” and “y”.
num_samples (int) – Number of anomalies to generate.
slope (list[float]) – Slope for an anomaly.
anomaly_start (float, optional) – Start of the anomaly window (0-1). Defaults to 0.33.
anomaly_end (float, optional) – End of the anomaly window (0-1). Defaults to 0.66.
- Returns:
Data dicts with anomalies injected.
- Return type:
list
Examples
>>> train_set, val_set, test_set, all_data = dp.get_splits() >>> train_set_with_anomaly = DataProcess.add_synthetic_anomaly( train_set, num_samples=2, slope=[0.01, 0.1], )
- static decompose_data(data) Tuple[ndarray, float, ndarray, ndarray] [source]#
Decompose a time series into a linear trend, seasonality, and residual following:
Use a Fourier transform to estimate seasonality.
Deseasonalized_data = data - seasonality
Estimate a linear trend by fitting Deseasonalized_data with a first order polynomial
Estimate residual = data - trend - seasonality
- Parameters:
data (np.ndarray) – 1D array.
- Returns:
(trend, slope_of_trend, seasonality, residual)
- Return type:
tuple
Examples
>>> train_set, val_set, test_set, all_data = dp.get_splits() >>> trend, slope_of_trend, seasonality, residual = DataProcess.decompose_data(train_set["y"].flatten())
- get_data(split: str, standardization: bool | None = False, column: int | None = None) ndarray [source]#
Return a specific column’s values for a given data split.
- Parameters:
split (str) – One of [‘train’, ‘validation’, ‘test’, ‘all’].
Standardization (bool) – Whether to standardize the output.
column (Optional[int]) – Column index.
- Returns:
The extracted values.
- Return type:
np.ndarray
Examples
>>> values = dp.get_data(split="train", standardization=True, column=[0])
- get_split_indices() Tuple[array, array, array] [source]#
Get the index ranges for the train, validation, and test splits.
- Returns:
Train, validation, and test indices.
- Return type:
Tuple[np.array, np.array, np.array]
Examples
>>> train_index, val_index, test_index = dp.get_split_indices()
- get_splits() Tuple[Dict[str, ndarray], Dict[str, ndarray], Dict[str, ndarray]] [source]#
Return training, validation, and test splits