4.10 Time Series Forecast#

1. Introduction: The Importance of Time Series Forecasting in Geoscientific Research#

Time series forecasting has been a cornerstone of geoscientific research, providing essential insights into Earth system processes and enabling proactive decision-making in diverse applications. From predicting seasonal climate patterns and river discharge to monitoring seismic activity and volcanic eruptions, time series analysis offers a powerful framework to understand temporal trends, cycles, and anomalies in geophysical data.

Why Time Series Forecasting Matters in Geosciences

  1. Climate and Weather Forecasting:
    Seasonal and interannual climate variations, such as the El Niño-Southern Oscillation (ENSO), have profound impacts on agriculture, water resources, and disaster preparedness. Accurate forecasting helps manage risks associated with droughts, floods, and heatwaves.

  2. Hydrology and Water Resource Management:
    Streamflow and reservoir inflow forecasting are critical for managing water supplies, especially in regions prone to water scarcity or flooding.

  3. Seismic and Volcanic Monitoring:
    While precise prediction of earthquakes remains elusive, forecasting the likelihood of seismic events based on past activity patterns can guide hazard mitigation strategies. Similarly, time series data from volcanic monitoring systems provide early warning for eruptions.

  4. Oceanography and Marine Ecosystems:
    Forecasting changes in sea surface temperatures and ocean currents informs fisheries management and helps monitor phenomena like coral bleaching and ocean acidification.

  5. Renewable Energy:
    Forecasting solar radiation and wind patterns enables efficient integration of renewable energy sources into power grids.

Revival of Interest with AI and Deep Learning In recent years, the advent of AI and deep learning models such as recurrent neural networks (RNNs), transformers, and even large language models (LLMs) has revolutionized time series forecasting. These models:

  • Capture complex temporal dependencies and nonlinear patterns in ways that traditional statistical models often cannot.

  • Enable integration of diverse geoscientific datasets, including time series with spatial and multivariate dimensions.

  • Open new frontiers in tackling previously intractable problems, such as long-term climate forecasting and predicting rare geophysical events.

Goals of the Lecture In this lecture, we will:

  1. Explore the role of time series forecasting in geoscientific research.

  2. Develop and analyze a geoscientifically inspired toy dataset.

  3. Compare approaches:

    • Baseline statistical models for their simplicity and robustness.

    • Classic machine learning models to understand feature engineering and generalization.

    • Deep learning models to harness the power of neural networks for temporal prediction.

  4. Evaluate these methods in terms of their predictive performance, interpretability, and scalability to real-world geoscientific problems.


2. Important Concepts in Time series forecasting#

  1. Context In statistical learning, the context in a time series can refer to the historical data or the portion of the past observations used to predict future values. This historical context is essential for capturing temporal dependencies, trends, seasonality, and other repeating patterns. The amount of context needed depends on the nature of the dataset—for instance, long-term climatic forecasts often require years of past data, while seismic activity predictions may focus on shorter, event-driven contexts. In scientific inquiry, the context might be the scientific knowledge or additional “covariate” (e.g., temperature and precipitation) that might help inform the forecast.

  2. Look-Back Window defines the specific number of past time steps used as input to the forecasting model, equivalent to the context. For example:

    • A short look-back window might capture local or short-term dependencies (e.g., hourly temperature changes).

    • A long look-back window can capture more extended trends and seasonal patterns (e.g., annual temperature cycles). Choosing an appropriate look-back window is a balance: too short and the model may miss important patterns, too long and it may include irrelevant or noisy data.

  3. Horizon refers to how far into the future the model predicts. This can range from:

    • Short-term horizons, such as the next hour or day, where temporal dependencies are strong and relatively predictable.

    • Long-term horizons, such as monthly or annual predictions, where uncertainty grows due to external influences, stochastic variability, and system non-linearity.

    • Prediction Length defines the number of future time steps forecasted in a single prediction. Models can be designed to:

  4. Increasing Uncertainty with Longer Horizons Forecast accuracy generally decreases as the prediction horizon lengthens. This is because:

    • Error propagation: In multi-step predictions, small errors in early predictions compound as they are fed into the model for subsequent predictions.

    • Complexity of future influences: Longer horizons are more affected by external and often unpredictable factors, such as sudden climatic events or anomalous geophysical activity.

    • Loss of signal strength: Temporal correlations weaken over time, making it harder for models to extract meaningful patterns for far-future predictions.


3. Toy Dataset#

We’ll generate synthetic geoscientific time series data simulating monthly surface temperature anomalies for a fictional region over 30 years. The data will incorporate:

  1. Seasonality: A sinusoidal signal with annual cycles.

  2. Trend: A small upward trend over time (e.g., global warming).

  3. Noise: Gaussian noise to simulate variability.

  4. Anomalies: Sudden deviations simulating geoscientific events like volcanic eruptions or ENSO effects.

import numpy as np
import pandas as pd

# Generate time series
np.random.seed(42)
time = pd.date_range(start='1990-01-01', end='2019-12-31', freq='M')
n = len(time)

# Components
trend = np.linspace(0, 2, n)  # Linear trend
seasonality = 2 * np.sin(2 * np.pi * (np.arange(n) % 12) / 12)  # Annual cycle
noise = np.random.normal(0, 0.5, n)  # Random noise
anomalies = np.zeros(n)
anomalies[np.random.choice(n, 5, replace=False)] = np.random.uniform(-5, 5, 5)  # Sudden anomalies

# Final series
temperature_anomalies = trend + seasonality + noise + anomalies

# Create DataFrame
data = pd.DataFrame({'Date': time, 'Temperature_Anomaly': temperature_anomalies})
data.set_index('Date', inplace=True)
data.to_csv('geoscientific_time_series.csv')
data.head()
Temperature_Anomaly
Date
1990-01-31 0.248357
1990-02-28 0.936439
1990-03-31 2.067037
1990-04-30 2.778228
1990-05-31 1.637258
# plot the data
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Temperature_Anomaly'], label='Temperature Anomaly')
plt.title('Simulated Temperature Anomalies')
plt.xlabel('Date')
plt.ylabel('Anomaly')
plt.grid()
plt.legend()
plt.show()
../_images/mlgeo_4.10_timeseriesforecast_2_0.png

Define Context and Horizon

Given this long time series, we will focus on making a rolling forecast by defining a context and a horizon window.

# define a context and a horizon window
context_window = 12
horizon_window = 6
# Split data into train/test, which is basically the context (train) and horizon (test)
train = data['Temperature_Anomaly'][:-horizon_window] # context window here is the entire thing
test = data['Temperature_Anomaly'][-horizon_window:]

3. Forecasting Methods#

3.1 Baseline: Statistical Forecasts#

  • Naive Forecast: Use the last observed value as the next prediction.

  • Seasonal Naive Forecast: Use the value from the same season in the previous year.

  • ARIMA: Fit an AutoRegressive Integrated Moving Average model.

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Naive Forecast
naive_forecast = np.array([train.iloc[-1]] * horizon_window)
naive_rmse = np.sqrt(mean_squared_error(test, naive_forecast))
naive_mae = np.mean(np.abs(test - naive_forecast))

# seasonal naive forecast
seasonal_naive_forecast = np.array([train.iloc[-12]] * horizon_window)
seasonal_naive_rmse = np.sqrt(mean_squared_error(test, seasonal_naive_forecast))
seasonal_naive_mae = np.mean(np.abs(test - seasonal_naive_forecast))

# ARIMA Model
arima_model = ARIMA(train, order=(2, 1, 2)).fit()
arima_forecast = arima_model.forecast(steps=horizon_window)
arima_rmse = np.sqrt(mean_squared_error(test, arima_forecast))
arima_mae = np.mean(np.abs(test - arima_forecast))

# report all evaluation metrics
print(f"Naive RMSE: {naive_rmse}, Naive MAE: {naive_mae}")
print(f"Seasonal Naive RMSE: {seasonal_naive_rmse}, Seasonal Naive MAE: {seasonal_naive_mae}")
print(f"ARIMA RMSE: {arima_rmse}, ARIMA MAE: {arima_mae}")
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency M will be used.
  self._init_dates(dates, freq)
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency M will be used.
  self._init_dates(dates, freq)
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency M will be used.
  self._init_dates(dates, freq)
Naive RMSE: 2.389823903102479, Naive MAE: 2.2973182078545107
Seasonal Naive RMSE: 1.5820363949721554, Seasonal Naive MAE: 1.4384894213432322
ARIMA RMSE: 0.5550037051446414, ARIMA MAE: 0.4465414221142508
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
# plot the three predictions from the base models
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Temperature_Anomaly'], label='Actual')
plt.plot(test.index, arima_forecast, label='ARIMA Forecast', color='red')
plt.plot(test.index, naive_forecast, label='ARIMA Forecast', color='orange')
plt.plot(test.index, seasonal_naive_forecast, label='ARIMA Forecast', color='yellow')
plt.title('ARIMA Forecast vs Actual')
plt.xlabel('Date')
plt.ylabel('Anomaly')
plt.grid()
plt.legend()
plt.show()
../_images/mlgeo_4.10_timeseriesforecast_8_0.png

3.2. Classic Machine Learning#

  • Use Lag Features and Seasonality Features as input.

  • Algorithms: Linear Regression, Random Forest, Gradient Boosting.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# Create context features
for lag in range(1, context_window + 1):
    data[f'Lag_{lag}'] = data['Temperature_Anomaly'].shift(lag) # shift the data
data.dropna(inplace=True)


# Create training data as the past, test data as the future 
X = data.iloc[:, 1:].values
y = data['Temperature_Anomaly'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Gradient Boosting Regressor
model = GradientBoostingRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate
gbr_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
gbr_mae = np.mean(np.abs(y_test - y_pred))
print(f"Gradient Boosting RMSE: {gbr_rmse} , Gradient Boosting MAE: {gbr_mae}")

# plot the predictions
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Temperature_Anomaly'], label='Actual')
plt.plot(data.index[-len(y_test):], y_pred, label='GBR Forecast', color='green')
plt.title('Gradient Boosting Forecast vs Actual')
plt.xlabel('Date')
plt.ylabel('Anomaly')
plt.grid()
plt.legend()
Gradient Boosting RMSE: 0.6784327786162885 , Gradient Boosting MAE: 0.5225129530636484
<matplotlib.legend.Legend at 0x30ca181c0>
../_images/mlgeo_4.10_timeseriesforecast_10_2.png

3.3 Deep Learning#

  • Models: RNNs, LSTMs, GRUs, or Transformers.

  • Framework: PyTorch.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Custom dataset for PyTorch
class TimeSeriesDataset(Dataset):
    def __init__(self, data, context_window, horizon_window):
        self.X = [data[i:i + context_window] for i in range(len(data) - (context_window+horizon_window))]
        self.y = [data[i+context_window:i+context_window+horizon_window] for i in range(len(data) - (context_window+horizon_window))]
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return torch.tensor(self.X[idx], dtype=torch.float32), torch.tensor(self.y[idx], dtype=torch.float32)

# RNN Model
class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(RNNModel, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        _, h_n = self.rnn(x)
        out = self.fc(h_n[-1])
        return out
# Prepare data
ts_data = TimeSeriesDataset(data['Temperature_Anomaly'].values, context_window, horizon_window)
dataloader = DataLoader(ts_data, batch_size=16, shuffle=True)
# Model, loss, and optimizer
print(context_window, horizon_window)
model = RNNModel(input_size=context_window, hidden_size=50, num_layers=4, output_size=horizon_window)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
12 6
# Training loop
for epoch in range(50):
    epoch_loss = 0
    for X_batch, y_batch in dataloader:
        # X_batch = X_batch.unsqueeze(-1)  # Add feature dimension
        y_pred = model(X_batch).squeeze()
        loss = criterion(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f'Epoch: {epoch}, Loss: {epoch_loss/len(dataloader)}')
Epoch: 0, Loss: 3.00634267216637
Epoch: 1, Loss: 2.6486164728800454
Epoch: 2, Loss: 2.626796557789757
Epoch: 3, Loss: 2.6268500714074996
Epoch: 4, Loss: 2.6511067549387612
Epoch: 5, Loss: 2.6555847894577753
Epoch: 6, Loss: 2.6372431346348355
Epoch: 7, Loss: 2.627526181084769
Epoch: 8, Loss: 2.7000722998664495
Epoch: 9, Loss: 2.6251020034154258
Epoch: 10, Loss: 2.608908255894979
Epoch: 11, Loss: 2.613211399032956
Epoch: 12, Loss: 2.6141854581378756
Epoch: 13, Loss: 2.620620619683039
Epoch: 14, Loss: 2.624014150528681
Epoch: 15, Loss: 2.605562698273432
Epoch: 16, Loss: 2.6024255752563477
Epoch: 17, Loss: 2.585343973977225
Epoch: 18, Loss: 2.5852765242258706
Epoch: 19, Loss: 2.665037166504633
Epoch: 20, Loss: 2.6180632625307356
Epoch: 21, Loss: 2.6397100516727994
Epoch: 22, Loss: 2.572316567103068
Epoch: 23, Loss: 2.571771576291039
Epoch: 24, Loss: 2.616808993475778
Epoch: 25, Loss: 2.5992123512994674
Epoch: 26, Loss: 2.5893679005759105
Epoch: 27, Loss: 2.5868957894189015
Epoch: 28, Loss: 2.5527749913079396
Epoch: 29, Loss: 2.5988870348249162
Epoch: 30, Loss: 2.5735259453455606
Epoch: 31, Loss: 2.615531495639256
Epoch: 32, Loss: 2.623429911477225
Epoch: 33, Loss: 2.567932753335862
Epoch: 34, Loss: 2.588108857472738
Epoch: 35, Loss: 2.569101481210618
Epoch: 36, Loss: 2.5912431478500366
Epoch: 37, Loss: 2.5857595716203963
Epoch: 38, Loss: 2.5676723207746233
Epoch: 39, Loss: 2.5795442263285318
Epoch: 40, Loss: 2.622153815769014
Epoch: 41, Loss: 2.598675188564119
Epoch: 42, Loss: 2.568843750726609
Epoch: 43, Loss: 2.5866817860376266
Epoch: 44, Loss: 2.631495998019264
Epoch: 45, Loss: 2.6336796397254583
Epoch: 46, Loss: 2.586955535979498
Epoch: 47, Loss: 2.6301064434505643
Epoch: 48, Loss: 2.585471754982358
Epoch: 49, Loss: 2.547369820731027
# Evaluate
test_input = torch.tensor(data['Temperature_Anomaly'].values[-(context_window+horizon_window):-horizon_window].reshape(1,-1), dtype=torch.float32)
rnn_forecast = model(test_input).detach().numpy()
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Assuming test contains the ground truth values for the last 12 months
ground_truth = data['Temperature_Anomaly'][-horizon_window:].values

# Calculate RMSE
rnn_rmse = np.sqrt(mean_squared_error(ground_truth, rnn_forecast))
print(f"RNN RMSE: {rnn_rmse}")

# Calculate MAE
rnn_mae = mean_absolute_error(ground_truth, rnn_forecast)
print(f"RNN MAE: {rnn_mae}")
RNN RMSE: 0.64634883661374
RNN MAE: 0.47775794587681136

4. Comparison and Discussion#

  1. Evaluation Metrics:

    1. Mean Squared Error (MSE)

      • Use: Measures the average squared difference between predicted and actual values. It penalizes larger errors more heavily.

      • Considerations: While useful, MSE doesn’t account for trends or seasonal components, making it potentially less informative for complex time series (e.g., those with strong seasonality).

      • Best for: General forecasts when trends and magnitudes of errors are important.

    2. Mean Absolute Error (MAE)

      • Use: Measures the average of the absolute errors. Unlike MSE, it doesn’t heavily penalize outliers.

      • Considerations: MAE can be more robust when there are outliers, which is common in power-law distributed data. However, it also doesn’t account for trends or periodicity.

      • Best for: Situations where the model needs to be evaluated without over-penalizing large errors (e.g., for power-law distributions).

    3. Root Mean Squared Error (RMSE)

      • Use: RMSE is the square root of MSE, offering a more interpretable metric in terms of the units of the original data.

      • Considerations: Like MSE, it is sensitive to outliers. For time series with strong trends, RMSE can be large if forecasts consistently lag or lead trends.

      • Best for: General forecasting, but less ideal for highly seasonal or trending data unless seasonality is detrended.

    4. Mean Absolute Percentage Error (MAPE)

      • Use: Measures the percentage error between the forecast and the actual value. It normalizes the error by the magnitude of the actual values.

      • Considerations: MAPE can be problematic when actual values are close to zero (leading to high percentages). It is also less effective for power-law distributions where low-frequency, high-magnitude events can skew the results.

      • Best for: Forecasts where relative error matters more than absolute error, especially for non-power-law distributed data.

    5. Symmetric Mean Absolute Percentage Error (sMAPE)

      • Use: A variation of MAPE that adjusts for issues with division by zero and symmetry in errors.

      • Considerations: Like MAPE, it works better for data without extreme outliers or heavy-tailed distributions.

      • Best for: Relative error measurement, especially for periodic or seasonal data where symmetry in error matters.

    6. Mean Squared Logarithmic Error (MSLE)

      • Use: Useful when the forecast error should be penalized more when underestimating than when overestimating, especially for datasets with a wide dynamic range.

      • Considerations: Works well for power-law distributions or data with large variations in scale, as it compresses the impact of large errors.

      • Best for: Power-law distributed data where large-magnitude events need reduced influence.

    7. Autocorrelation of Residuals

      • Use: Measures whether forecast errors (residuals) are correlated over time, indicating if the model has captured the structure of the data.

      • Considerations: Important for time series with periodicity, as autocorrelated residuals suggest that the model has failed to account for recurring patterns or cycles.

      • Best for: Time series with strong periodicity or seasonality.

    8. Trend-Corrected Metrics:

      • Use: Metrics like Mean Absolute Scaled Error (MASE) help correct for trend or seasonality. MASE normalizes the error by a benchmark model (e.g., a naïve forecast).

      • Considerations: It adjusts for trend and seasonality, making it more reliable for data that has both components.

      • Best for: Data with both trends and periodicity.

    9. Spectral Score (for Periodic Data)

      • Use: Measures the alignment between the frequency components of the forecast and the actual data, useful for evaluating models on data with clear periodicity.

      • Considerations: This metric focuses on the frequency domain, complementing time-domain error metrics for highly periodic or cyclical data.

      • Best for: Time series dominated by periodicity or cyclical patterns.

    10. Quantile Loss/Pinball Loss

    • Use: Evaluates the forecast’s ability to predict different quantiles, especially useful when forecasting intervals or uncertainty bounds.

    • Considerations: Works well with power-law distributions where you care about upper or lower tail behavior (e.g., rare but extreme events).

    • Best for: Time series with significant outliers or fat-tailed distributions.

    1. Directional Accuracy

    • Use: Measures whether the forecast correctly predicts the direction of change (up or down).

    • Considerations: In trending time series, this metric helps assess whether the forecast captures the general movement.

    • Best for: Data with strong trends, where capturing the direction of change is more important than minimizing the magnitude of error.

Metric Selection Based on Data Characteristics

  • Trends: Use trend-corrected metrics (MASE) and directional accuracy. Combine with RMSE or MAE to evaluate overall fit.

  • Periodicity: Autocorrelation of residuals, spectral score, or sMAPE.

  • Power-Law Distributions: Quantile loss, MSLE, or robust metrics like MAE that handle outliers better than squared-error metrics.

import pandas as pd
import numpy as np

# Assuming you have the RMSE and MAE values for each model
# Replace these values with the actual values from your models
results = {
    'Model': ['Baseline', 'CML', 'DL'],
    'RMSE': [arima_rmse, gbr_rmse, rnn_mae],  # Replace with actual RMSE values
    'MAE': [arima_rmse, gbr_rmse, rnn_rmse]    # Replace with actual MAE values
}

# Create a DataFrame to store the results
results_df = pd.DataFrame(results)

# Print the results table
print(results_df)

# Save the results to a CSV file
results_df.to_csv('model_performance_comparison.csv', index=False)
      Model      RMSE       MAE
0  Baseline  0.555004  0.555004
1       CML  0.678433  0.678433
2        DL  0.477758  0.646349

5. Testing Foundation Models#

In this example, we will test Chronos on canonical geoscientific data sets chosen to be representative of geoscience/geophysics:

  1. earthquake catalogs from SoCal

  2. earthquake ground motion waveforms

  3. near surface properties (seismic velocities, temperature, soil moisture)

  4. ice velocity from greenland

  5. GPS positions along a plate boundary that captures seasons and tectonic loading. The data files are gathered on this repository: https://github.com/UW-MLGEO/MLGeo-dataset

The data has been prepared as CSV files with times series. The forescasts are added to the CSV files under key attributes “Chronos-zero-shot”.

Limitations by Chronos. Chronos uses large language models at its core for forecasting time series. The time series is “tokenized” to convert from an array of floats into contect tokens.

By Marine Denolle (mdenolle@uw.edu)

! pip install chronos-forecasting
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
0
import matplotlib.pyplot as plt
import torch
from chronos import ChronosPipeline, ChronosBoltPipeline
import pandas as pd
import numpy as np


# Set the font style and size
plt.rcParams.update({
    'font.size': 12,
    'font.family': 'sans-serif',
    'font.serif': ['DejaVu Sans'],
    'axes.titlesize': 12,
    'axes.labelsize': 10,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10
})
# benchmarking with ARIMA
!pip install statsmodels
from statsmodels.tsa.arima.model import ARIMA

# Assuming df is your DataFrame and 'datetime' is your datetime column
# df["datetime"] = pd.to_datetime(df["datetime"])
# df.set_index("datetime", inplace=True)

# # Assuming 'value' is the column you want to forecast
# data = df['value']

# Fit ARIMA model
# model = ARIMA(data, order=(5, 1, 0))  # (p, d, q) order, adjust as needed
# model_fit = model.fit()

# # Make predictions
# forecast_steps = 10  # Number of steps to forecast
# forecast = model_fit.forecast(steps=forecast_steps)

# # Plot the results
# plt.figure(figsize=(10, 6))
# plt.plot(data, label='Actual')
# plt.plot(forecast, label='Forecast', color='red')
# plt.legend()
# plt.show()

# # Print forecasted values
# print(forecast)
n_timeseries = 20
pipeline = ChronosBoltPipeline.from_pretrained(
    "amazon/chronos-bolt-base",
    device_map="cuda",  # use "cpu" for CPU inference and "mps" for Apple Silicon
    torch_dtype=torch.bfloat16,
)
from chronos import BaseChronosPipeline, ChronosPipeline, ChronosBoltPipeline

print(BaseChronosPipeline.predict.__doc__)  # for Chronos models
print(ChronosPipeline.predict.__doc__)  # for Chronos models
        Get forecasts for the given time series. Predictions will be
        returned in fp32 on the cpu.

        Parameters
        ----------
        context
            Input series. This is either a 1D tensor, or a list
            of 1D tensors, or a 2D tensor whose first dimension
            is batch. In the latter case, use left-padding with
            ``torch.nan`` to align series of different lengths.
        prediction_length
            Time steps to predict. Defaults to a model-dependent
            value if not given.

        Returns
        -------
        forecasts
            Tensor containing forecasts. The layout and meaning
            of the forecasts values depends on ``self.forecast_type``.
        

        Get forecasts for the given time series.

        Refer to the base method (``BaseChronosPipeline.predict``)
        for details on shared parameters.

        Additional parameters
        ---------------------
        num_samples
            Number of sample paths to predict. Defaults to what
            specified in ``self.model.config``.
        temperature
            Temperature to use for generating sample tokens.
            Defaults to what specified in ``self.model.config``.
        top_k
            Top-k parameter to use for generating sample tokens.
            Defaults to what specified in ``self.model.config``.
        top_p
            Top-p parameter to use for generating sample tokens.
            Defaults to what specified in ``self.model.config``.
        limit_prediction_length
            Force prediction length smaller or equal than the
            built-in prediction length from the model. False by
            default. When true, fail loudly if longer predictions
            are requested, otherwise longer predictions are allowed.

        Returns
        -------
        samples
            Tensor of sample forecasts, of shape
            (batch_size, num_samples, prediction_length).
        
print(ChronosBoltPipeline.predict.__doc__)  # for Chronos-Bolt models
        Get forecasts for the given time series.

        Refer to the base method (``BaseChronosPipeline.predict``)
        for details on shared parameters.
        Additional parameters
        ---------------------
        limit_prediction_length
            Force prediction length smaller or equal than the
            built-in prediction length from the model. False by
            default. When true, fail loudly if longer predictions
            are requested, otherwise longer predictions are allowed.

        Returns
        -------
        torch.Tensor
            Forecasts of shape (batch_size, num_quantiles, prediction_length)
            where num_quantiles is the number of quantiles the model has been
            trained to output. For official Chronos-Bolt models, the value of
            num_quantiles is 9 for [0.1, 0.2, ..., 0.9]-quantiles.

        Raises
        ------
        ValueError
            When limit_prediction_length is True and the prediction_length is
            greater than model's trainig prediction_length.
        

Some functions

def reshape_time_series(df, name_of_target="count" ,n_timeseries=20, duration_years=2):
    """
    Generate a list of time series from the given DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing a 'datetime' column.
    n_timeseries (int): Number of time series to generate.
    duration_years (int): Duration of each time series in years.
    resample_period (str): Resampling period (e.g., 'D' for daily).

    Returns:
    list: A list of DataFrames, each containing a time series.
    pd.DataFrame: A wide-format DataFrame containing the time series.
    """
    df_list = [pd.DataFrame() for _ in range(n_timeseries)]
    kk = 0
    while kk < n_timeseries:
        start_date = df["datetime"].sample().values[0]
        start_date = pd.to_datetime(start_date)
        # Create a time series for the specified duration
        end_date = start_date + pd.DateOffset(years=duration_years)
        if end_date > df["datetime"].max():
            continue    # Skip if the end date exceeds the maximum date in the catalog
        # Create a time series from the catalog and select only the date time and target columns
        time_series = df[(df["datetime"] >= start_date) & (df["datetime"] <= end_date)][["datetime", name_of_target]]
        time_series = time_series.set_index("datetime")
        # time_series = time_series.resample(resample_period).mean().interpolate() # resample and interpolate
        time_series = time_series.ffill()#(method="ffill") # forward fill
        time_series = time_series.bfill()#(method="bfill") # backward fill
        time_series = time_series.reset_index()# reset index to keep datetime as a column
        # remove the "datetime" column to the time_series
        
        df_list[kk] = time_series
        kk += 1 
    df_list_count = [pd.DataFrame() for _ in range(n_timeseries)]
    for ik in range(n_timeseries):
        df_list_count[ik] = df_list[ik][name_of_target]
        df_list_count[ik] = df_list_count[ik].rename(f"target_{ik}")
    df_wide = pd.concat(df_list_count, axis=1)

    # find the sampling rate dt as the difference between the first two dates
    dt = df["datetime"].diff().dt.total_seconds().fillna(0).mean()
    # convert dt to days
    dt = dt / (24 * 3600)
    print("sampling rate {:.2f} days".format(dt))

    # create a time array that is the index of the time series and convert the dae
    df_wide["time_index"] = np.arange(len(df_wide)) #* pd.Timedelta(days=dt)
    # move the last column to the first position
    cols = df_wide.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    df_wide = df_wide[cols]

    # rename columns to target_ID except for the first column that is a datetime
    df_wide.columns = [f"target_{i}" if i != 0 else "time_index" for i in range(len(df_wide.columns))]
    df_wide = df_wide.dropna() 
    
    return df_list, df_wide

# Example usage:
# df_list,df_wide = reshape_time_series(df, n_timeseries=20, duration_years=2, resample_period='D')
def predict_chronos(df, predict_length=64, n_timeseries=20):
    """
    Make predictions for the given time series data.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing the time series data.

    Returns:
    forecast_mean (np.array): The mean forecasted values for the time series.
    lower_bound (np.array): The 5% lower bound of the forecasted values.
    upper_bound (np.array): The 95% upper bound of the forecasted values.
    mean_mae (float): The mean absolute error of the forecasted values.
    no_var_mae (float): The mean absolute error assuming no change in the time series.


    """
    # Ensure df['Date'] is in datetime format
    # df['datetime'] = pd.to_datetime(df['datetime'])

    # Select the first n_timeseries columns for forecasting
    columns_to_forecast = df.columns[1:n_timeseries+1]  # +1 to skip the 'Date' column

    # Calculate the split index for training
    split_index = int(len(df) - predict_length)

    # Split the data into training and evaluation sets for all selected columns
    train_data = df[columns_to_forecast].iloc[:split_index]
    eval_data = df[columns_to_forecast].iloc[split_index:]

    # Convert the training data to a higher-dimensional tensor
    train_tensor = torch.tensor(train_data.values, dtype=torch.float32).T
    # Perform the forecasting using the training data
    # forecast = pipeline.predict(
    #     context=train_tensor,
    #     prediction_length=len(eval_data),  # Predict the same length as the evaluation set
    #     num_samples=50,
    # )
    forecast = pipeline.predict(
        context=train_tensor,
        prediction_length=len(eval_data))  # Predict the same length as the evaluation set
    # )

    # Take the mean across the samples (axis=1) for each time series
    forecast_mean = forecast.mean(dim=1).squeeze().numpy()
    # lower_bound = forecast.quantile(0.05, dim=1).squeeze().numpy()
    # upper_bound = forecast.quantile(0.95, dim=1).squeeze().numpy()


    mae=[]
    # Calculate and print the MAE for each time series
    for i, column_name in enumerate(columns_to_forecast):
        # Calculate MAE for the current time series
        mae.append (np.mean(np.abs(forecast_mean[i]- eval_data[column_name].values)))
        
        # Print the MAE
        print(f'Mean Absolute Error (MAE) for {column_name}: {mae[-1]}')

    mean_mae = np.array(mae).mean()
    print(f'Mean of forecast MAEs = {mean_mae}')

    # Calculate and print the MASE for each time series
    mase=[]
    for i, column_name in enumerate(columns_to_forecast):
        # Calculate MASE for the current time series
        mase.append (np.mean(np.abs(forecast_mean[i] - eval_data[column_name].values)) / np.mean(np.abs(eval_data[column_name].values[1:] - eval_data[column_name].values[:-1])))
        
        # Print the MASE
        print(f'Mean Absolute Scaled Error (MASE) for {column_name}: {mase[-1]}')

    mean_mase = np.array(mase).mean()
    print(f'Mean of forecast MAEs = {mean_mase}')


    # Baseline model: ARIMA
    mae_arima=[]
    # Calculate and print the MAE for each time series
    for i, column_name in enumerate(columns_to_forecast):
        # Calculate MAE for the current time series
        mae_arima.append (np.mean(np.abs(eval_data[column_name].values[-1] - eval_data[column_name].values)))
        
        # Print the MAE
        # print(f'MAE assuming d/dt=0 for {column_name}: {mae_nochangemodel[-1]}')

    mae_arima_mean = np.array(mae_arima).mean()
    print(f'Mean of ARIMA MAEs = {mae_arima_mean}')

    return forecast,mean_mae, mean_mase, mae_arima_mean, split_index
import matplotlib.dates as mdates
def plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="count",filename="geo-forecast.png"):
    """
    Plot the forecasted values along with the confidence intervals.

    Parameters:
    n_timeseries (int): Number of time series to plot.
    df_wide (pd.DataFrame): The wide-format DataFrame containing the time series.
    split_index (int): The index at which the training data ends (or context data) and evaluation (or forecast) data begins.
    forecast (np.array): The forecasted values for the time series.
    filename (str): The filename to save the plot.


    Returns:
    None


    """
    
    # Determine the number of rows and columns for the subplots
    n_timeseries = min(n_timeseries, 12)  # Cap the number of time series to 12
    nrows = (n_timeseries - 1) // 3 + 1
    ncols = min(n_timeseries, 3)

    # Select the first n_timeseries columns for forecasting
    columns_to_forecast = df_wide.columns[1:n_timeseries+1]  # +1 to skip the 'Date' column
    print(forecast.shape)
    forecast_mean = forecast.mean(dim=1).squeeze()
    print(forecast_mean.shape)
    lower_bound = np.percentile(forecast, 5, axis=0)
    upper_bound = np.percentile(forecast, 95, axis=0)

    # Layout the subplots
    fig, axes = plt.subplots(nrows, ncols, figsize=(20, 16), sharex=True)
    # Flatten the 2D array of axes for easy indexing
    if n_timeseries>1:
        axes = axes.flatten()
    else:
        axes = [axes]
        


    # Split the data into training and evaluation sets for all selected columns
    # train_data = df_wide[columns_to_forecast].iloc[:split_index]
    eval_data = df_wide[columns_to_forecast].iloc[split_index:]


    # Calculate and print the MASE for each time series
    mase=[]
    for i, column_name in enumerate(columns_to_forecast):
        # Calculate MASE for the current time series
        mase.append (np.mean(np.abs(forecast_mean[i].numpy()  - eval_data[column_name].values)) / np.mean(np.abs(eval_data[column_name].values[1:] - eval_data[column_name].values[:-1])))
        
    mean_mase = np.array(mase).mean()
    print(f'Mean of forecast MASEs = {mean_mase}')
    print(n_timeseries)
    # Iterate over the first n_timeseries and plot
    if n_timeseries>1:
        for i, column_name in enumerate(columns_to_forecast[0:n_timeseries]):
            # Plot the original data
            axes[i].plot(df_wide['time_index'], df_wide[column_name], label='Original Data')
            
            # Calculate the 5th and 95th percentiles for the confidence interval
            lower_bound = np.percentile(forecast[i, :, :], 5, axis=0)
            upper_bound = np.percentile(forecast[i, :, :], 95, axis=0)
            
            # Plot the forecast
            axes[i].plot(df_wide['time_index'].iloc[split_index:], forecast_mean[i], label='Forecast')
            
            # Plot the confidence intervals
            axes[i].fill_between(df_wide['time_index'].iloc[split_index:], lower_bound, upper_bound, 
                                color='r', alpha=0.2, label='95% CI')


            # set the x-axis labels as the number of days
            # axes[i].set_xticks(np.arange(0,len(df_wide['datetime']), step=30))
            axes[i].grid(True, which='both', axis='both', linestyle='--', linewidth=0.5, color='gray')

            # Plot the evaluation data for reference
            # axes[i].plot(df_wide['datetime'].iloc[split_index:], eval_data[column_name].values, '--', label='Evaluation Data')

            # Format the subplot
            axes[i].set_title(f'{field} {i} , MASE={mase[i]:.2f}')
            axes[i].set_ylabel(field)
            axes[i].set_xlabel('Time Index')
            if i==0: axes[i].legend()

            # Apply x-tick rotation
            # axes[i].xaxis.set_major_locator(mdates.AutoDateLocator())
            # axes[i].xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
            axes[i].tick_params(axis='x', rotation=45)
            # axes[i].xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
            # convert the x-labels to days every month
            # axes[i].xaxis.set_major_locator(plt.MaxNLocator(6))
    else:
        # Plot the original data
        axes[i].plot(df_wide['time_index'], df_wide[column_name], label='Original Data')
        
        # Calculate the 5th and 95th percentiles for the confidence interval
        lower_bound = np.percentile(forecast[i, :, :], 5, axis=0)
        upper_bound = np.percentile(forecast[ i, :, :], 95, axis=0)
        print(forecast_mean.shape)
        # Plot the forecast
        axes[i].plot(df_wide['time_index'].iloc[split_index:], forecast_mean, label='Forecast')
        
        # Plot the confidence intervals
        axes[i].fill_between(df_wide['time_index'].iloc[split_index:], lower_bound, upper_bound, 
                            color='r', alpha=0.2, label='95% CI')


        # set the x-axis labels as the number of days
        # axes[i].set_xticks(np.arange(0,len(df_wide['datetime']), step=30))
        axes[i].grid(True, which='both', axis='both', linestyle='--', linewidth=0.5, color='gray')

        # Plot the evaluation data for reference
        # axes[i].plot(df_wide['datetime'].iloc[split_index:], eval_data[column_name].values, '--', label='Evaluation Data')

        # Format the subplot
        axes[i].set_title(f'{field} {i} , MASE={mase[i]:.2f}')
        axes[i].set_ylabel(field)
        axes[i].set_xlabel('Time Index')
        if i==0: axes[i].legend()

        # Apply x-tick rotation
        # axes[i].xaxis.set_major_locator(mdates.AutoDateLocator())
        # axes[i].xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
        axes[i].tick_params(axis='x', rotation=45)




    # Remove empty subplots if n_timeseries < 8
    for j in range(i + 1, nrows * ncols):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.savefig(filename)
    # also save a SVG 
    plt.savefig(filename.replace('.png','.svg'))
    plt.show()


    # add a column to df_wide with the predictions for each target_{i} labels at the rows of the evaluation data
    for i, column_name in enumerate(columns_to_forecast):
        df_wide.loc[df_wide.index[split_index:], f'forecast_{i}'] = forecast_mean[i].numpy()

    # store the new df_wide with the precition into a new CSV files that takes the filename of the original file
    df_wide.to_csv(filename.replace('.png','.csv'), index=False)

    return
# Quick plot
def quick_plot(df_wide,n_timeseries,field="count",dt=1):
    """
    Quick plot of the time series data.

    Parameters:
    df_wide (pd.DataFrame): The wide-format DataFrame containing the time series.
    n_timeseries (int): Number of time series to plot.
    field (str): The field to plot.


    Returns:
    None
    """
    # Determine the number of rows and columns for the subplots
    n_timeseries2 = min(n_timeseries, 6)  # Cap the number of time series to 12
    nrows = (n_timeseries2 - 1) // 3 + 1
    ncols = min(n_timeseries2, 3)

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(11, 8))  # define the figure and subplots
    axes = axes.ravel()
    for i in range(n_timeseries2):
        # create a time axis that multiplies time_data by dt
        # Calculate the time index as the difference from the first datetime
        # df_wide['time_index'] = (df_wide['time_index'] - df_wide['time_index'].iloc[0]).dt.total_seconds() / (24 * 3600)

        # Convert the time_index to a float array
        time_index_days_float = df_wide['time_index'].values


      
        ax = axes[i]
        ax.plot(time_index_days_float, df_wide[f"target_{i+1}"], label=f"{field}_{i+1}")
        ax.set_title(f"{field} {i}")
        ax.set_xlabel("Days")
        ax.set_ylabel(field)
        plt.tight_layout()
        ax.grid()

    plt.show()

5.1 Earthquake Data#

# first we load data/data_qtm_catalog.csv
df = pd.read_csv("../data/data_qtm_catalog.csv")
df.head()
datetime count
0 2008-01-01 120
1 2008-01-02 89
2 2008-01-03 146
3 2008-01-04 166
4 2008-01-05 94
# calculate the time difference between each datetime to set up the dt
# convert the datetime column to datetime object
df["datetime"] = pd.to_datetime(df["datetime"])
dt = df["datetime"].diff().dt.total_seconds().fillna(0).mean()
# convert seconds to days
dt = np.ceil(dt / (60 * 60 * 24))
print(dt)
1.0

Reshape for Prediction#

# randomly cut the long time series into n_timeseries, stored in data frames.
df_list,df_wide = reshape_time_series(df, name_of_target="count", n_timeseries=n_timeseries, duration_years=2)
sampling rate 1.00 days
quick_plot(df_wide,n_timeseries,field="QTM count",dt=1)
../_images/mlgeo_4.10_timeseriesforecast_39_0.png

Predict with Chronos & Plot#

forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
# plot forecast
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="QTM event counts",filename="./plots/quake_qtm_forecast.png")
# save a vectorized plot
Mean Absolute Error (MAE) for target_1: 48.48504137992859
Mean Absolute Error (MAE) for target_2: 50.47730255126953
Mean Absolute Error (MAE) for target_3: 73.13016510009766
Mean Absolute Error (MAE) for target_4: 29.68954849243164
Mean Absolute Error (MAE) for target_5: 65.26891136169434
Mean Absolute Error (MAE) for target_6: 68.76318144798279
Mean Absolute Error (MAE) for target_7: 35.08878254890442
Mean Absolute Error (MAE) for target_8: 57.32973098754883
Mean Absolute Error (MAE) for target_9: 45.120017528533936
Mean Absolute Error (MAE) for target_10: 53.568915605545044
Mean Absolute Error (MAE) for target_11: 37.61477780342102
Mean Absolute Error (MAE) for target_12: 110.07398819923401
Mean Absolute Error (MAE) for target_13: 39.093090534210205
Mean Absolute Error (MAE) for target_14: 80.03271508216858
Mean Absolute Error (MAE) for target_15: 67.37908172607422
Mean Absolute Error (MAE) for target_16: 43.82308888435364
Mean Absolute Error (MAE) for target_17: 192.06181573867798
Mean Absolute Error (MAE) for target_18: 32.19101405143738
Mean Absolute Error (MAE) for target_19: 41.79396343231201
Mean Absolute Error (MAE) for target_20: 69.41318273544312
Mean of forecast MAEs = 62.019915759563446
Mean Absolute Scaled Error (MASE) for target_1: 1.0329920889196824
Mean Absolute Scaled Error (MASE) for target_2: 0.7644399184447067
Mean Absolute Scaled Error (MASE) for target_3: 1.615994528693845
Mean Absolute Scaled Error (MASE) for target_4: 0.7835951215011284
Mean Absolute Scaled Error (MASE) for target_5: 0.9174344970519285
Mean Absolute Scaled Error (MASE) for target_6: 1.675205116482179
Mean Absolute Scaled Error (MASE) for target_7: 0.9149806707702726
Mean Absolute Scaled Error (MASE) for target_8: 1.3712122445769082
Mean Absolute Scaled Error (MASE) for target_9: 1.037809822671646
Mean Absolute Scaled Error (MASE) for target_10: 0.9770821317745623
Mean Absolute Scaled Error (MASE) for target_11: 0.9411163628338063
Mean Absolute Scaled Error (MASE) for target_12: 1.2275909464598587
Mean Absolute Scaled Error (MASE) for target_13: 0.7725422533422971
Mean Absolute Scaled Error (MASE) for target_14: 0.9836248634757355
Mean Absolute Scaled Error (MASE) for target_15: 1.2271992335191315
Mean Absolute Scaled Error (MASE) for target_16: 0.8978388942160257
Mean Absolute Scaled Error (MASE) for target_17: 1.1899974814650582
Mean Absolute Scaled Error (MASE) for target_18: 1.251872768667009
Mean Absolute Scaled Error (MASE) for target_19: 1.2067001357633624
Mean Absolute Scaled Error (MASE) for target_20: 1.5479754025957226
Mean of forecast MAEs = 1.1168602241612433
Mean of ARIMA MAEs = 97.278125
torch.Size([20, 9, 64])
torch.Size([20, 64])
Mean of forecast MASEs = 1.1049544541817105
12
../_images/mlgeo_4.10_timeseriesforecast_41_1.png
df_wide.head()
time_index target_1 target_2 target_3 target_4 target_5 target_6 target_7 target_8 target_9 ... forecast_2 forecast_3 forecast_4 forecast_5 forecast_6 forecast_7 forecast_8 forecast_9 forecast_10 forecast_11
0 0 193 191.0 149.0 216.0 185.0 168.0 223.0 220.0 175.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 212 180.0 220.0 173.0 176.0 190.0 188.0 246.0 150.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 250 183.0 220.0 218.0 142.0 171.0 368.0 205.0 181.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 312 155.0 123.0 199.0 172.0 173.0 588.0 159.0 176.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 242 184.0 163.0 177.0 146.0 128.0 202.0 153.0 207.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 33 columns

!pip install statsmodels
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
Requirement already satisfied: statsmodels in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (0.14.2)
Requirement already satisfied: numpy>=1.22.3 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from statsmodels) (1.26.4)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from statsmodels) (1.11.4)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from statsmodels) (2.1.4)
Requirement already satisfied: patsy>=0.5.6 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from statsmodels) (0.5.6)
Requirement already satisfied: packaging>=21.3 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from statsmodels) (24.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)
Requirement already satisfied: tzdata>=2022.1 in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2024.1)
Requirement already satisfied: six in /home/mdenolle/miniconda/envs/quake_chronos_moirai/lib/python3.11/site-packages (from patsy>=0.5.6->statsmodels) (1.16.0)
# benchmarking with ARIMA
# Fit ARIMA model
data = df_wide['target_4'][:split_index]
model = ARIMA(data, order=(25, 2, 0))  # (p, d, q) order, adjust as needed
model_fit = model.fit()

# Make predictions
forecast_steps = 64  # Number of steps to forecast
forecast = model_fit.forecast(steps=forecast_steps)

split_index = len(df_wide['target_3'].values) - 64
print(forecast) 
# calculate the MASE between the forecast and the evaluation data
mase=[]
eval_data = df_wide['target_1'][split_index:]
print(len(df_wide['target_1'].values[split_index:]))
# Calculate and print the MASE for each time series
mase.append(np.mean(np.abs(forecast - df_wide['target_1'].values[split_index:])) / np.mean(np.abs(df_wide['target_1'].values[1:] - df_wide['target_1'].values[:-1])))
# print mase
print(f'Mean Absolute Scaled Error (MASE) for ARIMA: {mase[-1]}')

                 

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(data, label='Actual')
plt.plot(forecast, label='Forecast', color='red')
667    170.875002
668    109.156006
669    134.097091
670    115.263988
671    123.108176
          ...    
726    -36.171929
727    -35.465038
728    -41.488969
729    -41.760654
730    -47.384863
Name: predicted_mean, Length: 64, dtype: float64
64
Mean Absolute Scaled Error (MASE) for ARIMA: 4.420838227473031
[<matplotlib.lines.Line2D at 0x7fdd04707e90>]
../_images/mlgeo_4.10_timeseriesforecast_44_2.png

5.2 Icesheet velocities#

Now we test the time series of icesheet velocities in Greenland. This is work done by Brad Lipovsky.

'''
Read the data
'''
df = pd.read_csv('../data/data_ice_jakobshavn.csv',na_values=[-1])
# calculate the time difference between each datetime to set up the dt
# convert the datetime column to datetime object
df["datetime"] = pd.to_datetime(df["Date"])
dt = df["datetime"].diff().dt.total_seconds().fillna(0).mean()
# convert seconds to days
dt = np.ceil(dt / (60 * 60 * 24))
print("sampling rate {} days".format(dt))
sampling rate 8.0 days

Reshape for prediction#

df_wide = df
df_wide.dropna()
Date Pixel Value (x=2500, y=8200) Pixel Value (x=2500, y=8201) Pixel Value (x=2500, y=8202) Pixel Value (x=2500, y=8203) Pixel Value (x=2500, y=8204) Pixel Value (x=2500, y=8205) Pixel Value (x=2500, y=8206) Pixel Value (x=2500, y=8207) Pixel Value (x=2500, y=8208) ... Pixel Value (x=2509, y=8201) Pixel Value (x=2509, y=8202) Pixel Value (x=2509, y=8203) Pixel Value (x=2509, y=8204) Pixel Value (x=2509, y=8205) Pixel Value (x=2509, y=8206) Pixel Value (x=2509, y=8207) Pixel Value (x=2509, y=8208) Pixel Value (x=2509, y=8209) datetime
0 2015-01-01 2690.5570 2682.3616 2658.6174 2606.9355 2530.3400 2432.4604 2337.5603 2238.3130 2131.6187 ... 1860.1473 1811.2555 1772.2811 1739.5938 1691.1827 1651.6400 1610.4110 1563.1417 1513.4183 2015-01-01
1 2015-01-13 2714.7678 2694.3774 2657.5322 2604.6094 2541.0916 2492.7390 2451.3862 2415.7693 2235.5337 ... 1866.2997 1801.4703 1747.1519 1679.1748 1619.7880 1580.2765 1548.2367 1506.4203 1471.4894 2015-01-13
2 2015-01-25 2712.4210 2724.5880 2735.8308 2645.3513 2518.3074 2460.4731 2262.7056 2141.2844 2034.8848 ... 1807.4071 1738.1703 1679.8862 1627.8777 1575.1526 1554.8282 1534.6147 1520.7195 1515.0082 2015-01-25
4 2015-02-18 2641.4170 2613.6772 2577.1538 2528.2330 2444.6372 2356.7375 2280.1716 2216.3354 2187.7751 ... 1803.6461 1748.5105 1696.6669 1643.1813 1591.3317 1547.9862 1504.9419 1459.6843 1411.1000 2015-02-18
5 2015-03-02 2695.7231 2654.4783 2596.1223 2527.4956 2450.6010 2383.8125 2317.7700 2231.7078 2069.8728 ... 1768.4241 1721.1057 1695.4995 1661.2021 1618.2247 1578.6671 1531.1387 1487.0217 1451.7004 2015-03-02
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
437 2024-02-25 2355.7673 2317.1812 2267.6526 2218.5762 2174.4893 2119.7183 2036.1709 1886.2402 1742.2327 ... 1555.8588 1496.1707 1448.4729 1410.3757 1372.7169 1332.9342 1291.7050 1252.8741 1215.6035 2024-02-25
438 2024-03-08 2356.8071 2318.0103 2271.4500 2208.3280 2131.2380 2038.9028 1941.8640 1839.1848 1730.7375 ... 1584.8710 1545.4580 1487.4086 1430.3809 1381.1884 1338.1537 1301.4970 1264.5914 1226.1100 2024-03-08
439 2024-03-20 2369.9165 2318.2954 2239.7700 2166.4363 2106.1460 2046.7198 1997.2346 1934.0216 1812.1205 ... 1542.8604 1510.4088 1461.9852 1407.3877 1358.6483 1318.0781 1282.2297 1241.4624 1196.6234 2024-03-20
440 2024-04-01 2304.3186 2255.6506 2209.5908 2154.6484 2079.5513 1998.6061 1931.5712 1849.2580 1743.1146 ... 1568.9454 1520.7106 1462.4568 1407.1897 1359.1794 1317.0726 1278.7156 1242.9205 1201.6306 2024-04-01
441 2024-04-13 2300.0273 2244.5710 2165.4233 2103.4187 2059.3809 1990.8022 1887.8052 1750.1135 1652.9683 ... 1529.3740 1496.3535 1455.9960 1412.7965 1370.8213 1326.9905 1279.7404 1225.2361 1173.2060 2024-04-13

418 rows × 102 columns

# remove the first column and place the last at first position
df_wide = df_wide.drop(columns=["Date"])
cols = df_wide.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_wide = df_wide[cols]
# now rename all Pixel columns to indexes except for the first one
df_wide.columns = ["time_index"] + ["target_"+str(i) for i in range(len(df_wide.columns)-1)]   
quick_plot(df_wide,n_timeseries,field="Glacier Vel",dt=8)
../_images/mlgeo_4.10_timeseriesforecast_52_0.png
torch.cuda.empty_cache()
import gc
gc.collect()
notebook controller is DISPOSED. 

View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details.
n_predict = 5
time_forecast = n_predict * dt 
print("Forecasting for {} days".format(time_forecast))
Forecasting for 40.0 days

Predict & Plot forecasts with Chronos#

forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,predict_length=n_predict,n_timeseries=n_timeseries)
# plot
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="Velocity (m/a)",filename="./plots/ice_jakobshavn_forecast.png")
Mean Absolute Error (MAE) for target_0: 42.24474937499999
Mean Absolute Error (MAE) for target_1: 51.846288281250054
Mean Absolute Error (MAE) for target_2: 63.66113656250009
Mean Absolute Error (MAE) for target_3: 68.27394875
Mean Absolute Error (MAE) for target_4: 59.332813359374995
Mean Absolute Error (MAE) for target_5: 54.84119093749996
Mean Absolute Error (MAE) for target_6: 51.02073972656249
Mean Absolute Error (MAE) for target_7: 56.604517070312525
Mean Absolute Error (MAE) for target_8: 60.905154804687484
Mean Absolute Error (MAE) for target_9: 46.845470039062505
Mean Absolute Error (MAE) for target_10: 45.993564921874984
Mean Absolute Error (MAE) for target_11: 51.39334015625
Mean Absolute Error (MAE) for target_12: 63.96544187500003
Mean Absolute Error (MAE) for target_13: 71.486827890625
Mean Absolute Error (MAE) for target_14: 67.60769390625
Mean Absolute Error (MAE) for target_15: 68.74598835937499
Mean Absolute Error (MAE) for target_16: 62.17083863281255
Mean Absolute Error (MAE) for target_17: 65.56442820312505
Mean Absolute Error (MAE) for target_18: 66.02327011718754
Mean Absolute Error (MAE) for target_19: 64.03354175781246
Mean of forecast MAEs = 59.12804723632813
Mean Absolute Scaled Error (MASE) for target_0: 2.010735538753725
Mean Absolute Scaled Error (MASE) for target_1: 2.7710987795736437
Mean Absolute Scaled Error (MASE) for target_2: 2.3186581656485306
Mean Absolute Scaled Error (MASE) for target_3: 2.3714981221370772
Mean Absolute Scaled Error (MASE) for target_4: 2.0618065531055927
Mean Absolute Scaled Error (MASE) for target_5: 1.5175690902323826
Mean Absolute Scaled Error (MASE) for target_6: 0.7876400007342528
Mean Absolute Scaled Error (MASE) for target_7: 0.6949596678740014
Mean Absolute Scaled Error (MASE) for target_8: 0.966631879403238
Mean Absolute Scaled Error (MASE) for target_9: 1.3144366326702754
Mean Absolute Scaled Error (MASE) for target_10: 1.0350373126193724
Mean Absolute Scaled Error (MASE) for target_11: 1.7442638916464657
Mean Absolute Scaled Error (MASE) for target_12: 3.2202381431841833
Mean Absolute Scaled Error (MASE) for target_13: 2.6931039274055864
Mean Absolute Scaled Error (MASE) for target_14: 2.29931356317334
Mean Absolute Scaled Error (MASE) for target_15: 2.0899756897892243
Mean Absolute Scaled Error (MASE) for target_16: 1.4075847948663025
Mean Absolute Scaled Error (MASE) for target_17: 1.3198358208382497
Mean Absolute Scaled Error (MASE) for target_18: 1.4886333326686638
Mean Absolute Scaled Error (MASE) for target_19: 2.2221349867240554
Mean of forecast MAEs = 1.8167577946524083
Mean of ARIMA MAEs = 55.618072
torch.Size([20, 9, 5])
torch.Size([20, 5])
Mean of forecast MASEs = 1.6328613028665464
12
../_images/mlgeo_4.10_timeseriesforecast_56_1.png
# benchmarking with ARIMA
# Fit ARIMA model
data = df_wide['target_3'][:split_index]
model = ARIMA(data, order=(5, 2, 1))  # (p, d, q) order, adjust as needed
model_fit = model.fit()

# Make predictions
forecast_steps = 64  # Number of steps to forecast
forecast = model_fit.forecast(steps=forecast_steps)

split_index = len(df_wide['target_3'].values) - 64
print(forecast) 
# calculate the MASE between the forecast and the evaluation data
mase=[]
eval_data = df_wide['target_1'][split_index:]
print(len(df_wide['target_1'].values[split_index:]))
# Calculate and print the MASE for each time series
mase.append(np.mean(np.abs(forecast - df_wide['target_1'].values[split_index:])) / np.mean(np.abs(df_wide['target_1'].values[1:] - df_wide['target_1'].values[:-1])))
# print mase
print(f'Mean Absolute Scaled Error (MASE) for ARIMA: {mase[-1]}')

                 

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(data, label='Actual')
plt.plot(forecast, label='Forecast', color='red')

5.3 Forecasting GPS velocities#

In this case, we are going to test Chrono’s performance on GPS time series. Vertical components tend to exhibit seasonal loading from precipitation, horizontal components tend to exhibit tectonic processes, especially at plate boundaries.

# read data from data_gps_P395_relateive_position.csv
fname = "../data/data_gps_P395_relative_position.csv"
df = pd.read_csv(fname)
# convert dacimal year column to floats
df["decimal year"] = df["decimal year"].astype(float)
df.head()

# the date format is in decimal years, convert it to datetime
# df["datetime"] = pd.to_datetime(df["decimal year"], format="%Y.%j")
# df.head()


# perform a running mean average to smooth the data
for ikey in df.keys()[1:]:
    df[ikey]=df[ikey].rolling(window=20).mean()
sta_name = fname.split("/")[-1].split("_")[2]
print(sta_name)
# take the first column "decimal year" and convert it to a datetime by taking the year before the comma, then multuply by 365.25 to get the days
df["datetime"] = pd.to_datetime((df["decimal year"] - 1970) * 365.25, unit='D', origin='1970-01-01')
# move the last to the first position
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
df.head()
df_list,df_wide = reshape_time_series(df,name_of_target="new delta v (m)", n_timeseries=n_timeseries, duration_years=4)
# now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
# plot the forecast
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="GPS relative position (m)",filename="./plots/gps_"+sta_name+"_v_forecast.png")

Horizontal components#

df_list,df_wide = reshape_time_series(df,name_of_target="new delta e (m)", n_timeseries=n_timeseries, duration_years=2)
# now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
# plot the forecast
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="GPS relative position (m)",filename="./plots/gps_"+sta_name+"_e_forecast.png")
torch.cuda.empty_cache()
import gc
gc.collect()

5.4 Forecast dv/v#

# read one dv/v file
fname = "../data/DVV_data_withMean/Data_BGU.csv"
df = pd.read_csv(fname)
df.head()
sta_name = fname.split("/")[-1].split("_")[1]
# convert the date into a timestamp
df["datetime"] = pd.to_datetime(df["date"])
# move datetime to the first position
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]

5.5 Soil moisture#

df_list,df_wide = reshape_time_series(df,name_of_target="sm_ewt", n_timeseries=n_timeseries, duration_years=4)
# now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
# plot the forecast
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="Soil Moisture",filename="./plots/"+sta_name+"_SM_forecast.png")
df.keys()
for ikey in df.keys()[2:]:
    if ikey == "lake":
        # remove mean of the column
        df[ikey] = df[ikey] - df[ikey].mean()
    df_list,df_wide = reshape_time_series(df,name_of_target=ikey, n_timeseries=n_timeseries, duration_years=4)
    # now predict with chronos
    forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
    # plot the forecast
    plot_forecasts(df_wide,split_index,forecast,n_timeseries,field=ikey,filename="./plots/"+sta_name+"_"+ikey+"_forecast.png")
torch.cuda.empty_cache()
import gc
gc.collect()

5.6 Seismic Wave#

# read CSV file with waveforms in them
df = pd.read_csv("../data/data_waveforms.csv")
df.head()
# trim data between 100 and 200 rows
df = df.iloc[300:500]
plt.plot(df['time'],df['Z'])
#make a datetime column
df["datetime"] = pd.to_datetime(df["time"], unit='s')
df.head()
df_wide.head()
df_wide = df
# rename the keys from Z, N, E, to target_0, target_1, target_2
df_wide.columns = ["time","target_0","target_1","target_2","datetime"]

# reset df_wide index:
df_wide = df_wide.reset_index(drop=True)

# create a time_index column that is the index of the time series
df_wide["time_index"] = np.arange(len(df_wide))

# move datetime to the first position
cols = df_wide.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_wide = df_wide[cols]

# drop columns time and datetime
df_wide = df_wide.drop(columns=["time","datetime"])

# # now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,predict_length=64, n_timeseries=len(df.keys())-1)
# # plot the forecast
plot_forecasts(df_wide,split_index,forecast,len(df.keys())-1,field="wave",filename="./plots/waveforms_forecast.png")
# normalize the target data in df_wide to see if this is the problem

df_wide_norm = df_wide.copy()
for ikey in df_wide.keys():
    if ikey != "time_index":
        df_wide_norm[ikey] = (df_wide[ikey] - df_wide[ikey].mean()) / df_wide[ikey].std()


# # now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide_norm,predict_length=64, n_timeseries=len(df.keys())-1)
# # plot the forecast
plot_forecasts(df_wide_norm,split_index,forecast,len(df.keys())-1,field="wave",filename="./plots/waveforms_forecast_norm.png")
torch.cuda.empty_cache()
import gc
gc.collect()

5.7 CO2#

Here we plan to forecast CO2.

from datetime import datetime, timedelta
fname = "../data/cleaned_data_co2.csv"
df_co2 = pd.read_csv(fname)


# Function to convert decimal year to datetime
def decimal_year_to_datetime(decimal_year):
    year = int(decimal_year)
    remainder = decimal_year - year
    start_of_year = datetime(year, 1, 1)
    days_in_year = (datetime(year + 1, 1, 1) - start_of_year).days
    return start_of_year + timedelta(days=remainder * days_in_year)

# Apply the function to the 'decimale-date' column
df_co2["datetime"] = df_co2["decimale-date"].apply(decimal_year_to_datetime)


# df_co2["datetime"] = pd.to_datetime((df_co2["decimale-date"] - 1970) * 365.25, origin='1970-01-01', unit='D')
df_co2.head()
# # move datetime to the first position
cols = df_co2.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_co2 = df_co2[cols]
df_co2.head()
df_list,df_wide = reshape_time_series(df_co2,name_of_target="monthly-average", n_timeseries=n_timeseries, duration_years=20)
# now predict with chronos
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_wide,n_timeseries=n_timeseries)
# plot the forecast
plot_forecasts(df_wide,split_index,forecast,n_timeseries,field="C02 Moana",filename="./plots/CO2_forecast.png")

5.8 Paleo Climate Benthic D018 time series#

This time series is phenomenal because it shows how the system switches from 100ka cycle to 20-40ka cycles. https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2004PA001071

df_benthic = pd.read_csv("../data/cleaned_lr04.csv")
df_benthic.head()
fig=plt.figure(figsize=(15, 6))
plt.plot(df_benthic["Time-(ka)"],df_benthic["Benthic-d18O-(per-mil)"])   
plt.xlabel("Time (ka)")
plt.ylabel("Benthic d18O (per mil)")
plt.title("Benthic d18O vs Time")
plt.grid() 
# add a time_index column to the dataframe
df_benthic["time_index"] = np.arange(len(df_benthic))
# remove Time-(ka) column
df_benthic = df_benthic.drop(columns=["Time-(ka)"])
# move the last to the first position
cols = df_benthic.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_benthic = df_benthic[cols]
df_benthic.head()
df_benthic_copy.describe()
# now predict with chronos
df_benthic_copy = df_benthic.iloc[:len(df_benthic)//3]
# rename benthic-d18O-(per-mil) to target_1
df_benthic_copy = df_benthic_copy.rename(columns={"Benthic-d18O-(per-mil)":"target_1"})
forecast, mean_mae, mean_mase, no_var_mae, split_index = predict_chronos(df_benthic_copy,predict_length=64,n_timeseries=1)
# plot the forecast
plot_forecasts(df_benthic_copy,split_index,forecast,1,field="D018",filename="./plots/d018_forecast.png")