EvergreenApril 7, 2026

Walk-Forward Cross-Validation in Commodity ML Models: Why Backtesting Alone Fails

CopperNickelCobaltLithium
K-fold leakage inflates commodity model AUC by 15–30% vs walk-forward

The Information Leakage Problem in Commodity Models

Most machine learning tutorials default to k-fold cross-validation. For tabular classification or NLP tasks with i.i.d. samples, this is fine. For time series problems in commodity markets, it is catastrophically wrong.

The issue is temporal leakage. When a k-fold split randomly assigns observations from March 2023 to the training set and observations from January 2023 to the validation set, the model trains on future information. In commodity volatility forecasting, where autocorrelation in realized vol, inventory cycles, and geopolitical shock persistence define the data-generating process, this leakage inflates performance metrics by 15–30% relative to true out-of-sample accuracy.

A model evaluated this way will report an AUC of 0.90 in development and deliver 0.70 in production. The gap is not a bug in the model. It is a bug in the evaluation protocol.

Any systematic desk running ML-derived signals for volatility probability forecasts needs validation that mirrors the actual deployment scenario: the model sees only past data and predicts forward.

How Walk-Forward Validation Works

Walk-forward cross-validation (also called time series split or expanding window validation) enforces strict temporal ordering. The procedure:

  1. Define an initial training window (e.g., 365 days of features and labels).
  2. Train the model on this window.
  3. Evaluate on the next N days (the forward validation window).
  4. Expand the training window to include those N days.
  5. Repeat until the dataset is exhausted.

Each fold's validation set contains only observations strictly after the training set. No future data contaminates the training process. The resulting performance estimate reflects what the model would have delivered if deployed at each point in time.

The Volterra model uses this exact procedure. The XGBoost classifier producing 7-day, 14-day, and 30-day volatility probability forecasts across 12 exchange-traded minerals is walk-forward cross-validated, yielding a mean AUC of 0.815 across all minerals and horizons. That number reflects genuine out-of-sample discriminative power, not an artifact of temporal leakage.

A variant worth considering is the sliding window approach, where the training window has a fixed length rather than expanding. This handles non-stationarity better when the data-generating process shifts over time. In commodity markets, where volatility regimes change with supply concentration dynamics and structural demand shifts, both expanding and sliding windows have trade-offs. Expanding windows give the model more data; sliding windows prevent stale regimes from diluting recent patterns.

Feature Engineering Hazards Specific to Commodities

Walk-forward validation catches model-level leakage. But feature-level leakage requires separate attention.

Common pitfalls in commodity ML pipelines:

Lagged feature alignment. If your feature matrix includes "7-day realized volatility" as of date T, and your label is "did volatility exceed the 80th percentile in the next 7 days starting from T," there is an overlapping window. The feature and label share price data from days T through T+6. Proper alignment requires features computed strictly before the label window opens.

Alternative data timestamps. News-derived features from sources like the GDELT Global Knowledge Graph arrive with publication timestamps, but processing latency means a story published at 23:50 UTC may not appear in a structured dataset until the following day. The Volterra pipeline processes 96 GDELT GKG files daily and enforces strict as-of-date alignment to prevent look-ahead bias from publication-to-availability delays.

Macro indicator revisions. GDP, PMI, and trade flow data are frequently revised. Using revised values that were unavailable at prediction time is a subtle form of leakage that walk-forward validation alone will not catch unless the feature store preserves point-in-time snapshots.

Implications for Signal Consumers

For desks consuming third-party volatility signals, the validation methodology behind those signals directly affects portfolio-level outcomes. An overfit signal introduces systematic bias into VaR estimates, option hedging ratios, and procurement timing decisions.

Questions to ask any signal provider: Is validation walk-forward or k-fold? What is the gap between in-sample and out-of-sample AUC? How are features aligned relative to the label window?

The Volterra dataset reports out-of-sample metrics from walk-forward validation and is available for independent backtesting. Full methodology details are documented in the model methodology section. Figures from the Volterra daily pipeline. Full historical backfill available on AWS Data Exchange.

For teams building internal models or evaluating external signals, the validation protocol is not a technical footnote. It is the difference between a signal that works in production and one that worked only in a research notebook.

Get daily volatility predictions

12 minerals. 3 horizons. Delivered before market open.