Risk Pricing Weather with Quantile Regression

December 14, 2025•4 min read

•Machine LearningXGBoostProduction MLQuantile Regression

I built the ML pipeline for a weather prediction market as part of a team project (CS 506). Teammates handled the Java backend and React frontend; I built the Python ML components: models that generate risk-adjusted odds for weather outcomes.

The work required recognizing a fundamental constraint: optimizing for prediction accuracy is the wrong goal when pricing risk. A point prediction of "77°F" cannot set odds. The system needed confidence intervals. Betting platforms price uncertainty, not outcomes.

This required moving from tree-based ensembles to quantile regression, and replacing randomized search with Bayesian optimization to model probability distributions instead of minimizing MAE.

Ensembles require uncorrelated errors

I started with an averaging ensemble: XGBoost, RandomForest, and LightGBM. The initial implementation hit a feature count mismatch during inference (82 vs 73). Feature lists were defined locally in training scripts rather than serialized globally.

After centralizing feature definitions, the ensemble achieved an MAE of 5.00°F, identical to the single best model. All three architectures were tree-based, yielding highly correlated errors. Averaging correlated predictions provides no benefit.

Stacking with a Ridge meta-model yielded the same result. The coefficients told the story:

XGBoost:      0.55
LightGBM:     0.42
RandomForest: 0.03

The meta-model discarded RandomForest and created an expensive weighted average of XGBoost and LightGBM.

I replaced the tree models with Ridge, SVR, and MLP to introduce algorithmic diversity. The meta-model assigned 80% weight to XGBoost anyway. For this dataset, a single tuned XGBoost model beat ensemble complexity.

Quantile regression for risk pricing

At MAE ~5.0°F, the application requirements became clear: this is risk management, not forecasting. Point predictions lack confidence intervals. Without quantifying variance, the system cannot distinguish a stable day from a volatile weather event.

I switched to quantile regression, training three models per target:

P10: 10th percentile (lower bound)
P50: 50th percentile (median)
P90: 90th percentile (upper bound)

XGBoost supports this natively via reg:quantileerror:

base_model = xgb.XGBRegressor(
    objective='reg:quantileerror',
    quantile_alpha=quantile,  # 0.1, 0.5, or 0.9
    random_state=RANDOM_SEED,
    n_jobs=-1,
    tree_method='hist'
)

The P90-P10 spread functions as a risk score. A 4°F spread signals high confidence (tighter odds); a 9°F spread signals volatility (wider odds to protect the house).

Feature selection and hyperparameter optimization

Randomized grid search proved inefficient for the parameter space. I replaced it with Bayesian optimization (Optuna) and expanded feature engineering to 100+ features: cyclical temporal encodings, rolling window statistics (3/7/14/30-day), lag features, and meteorological interactions.

To prevent overfitting, I used Recursive Feature Elimination with Cross-Validation (RFECV) to select the optimal feature subset per target. Hyperparameter tuning used TimeSeriesSplit to prevent temporal data leakage:

param_distributions = {
    'n_estimators': optuna.distributions.IntDistribution(100, 1000),
    'learning_rate': optuna.distributions.FloatDistribution(0.01, 0.3, log=True),
    'max_depth': optuna.distributions.IntDistribution(3, 10),
    'subsample': optuna.distributions.FloatDistribution(0.6, 1.0),
    'colsample_bytree': optuna.distributions.FloatDistribution(0.6, 1.0),
    'min_child_weight': optuna.distributions.IntDistribution(1, 10),
}

The final pipeline iterates through targets (high temp, precipitation, wind speed), executes RFECV, trains P10/P50/P90 models via Optuna, and serializes twelve model artifacts.

Deployment and validation

The production system runs nightly jobs that poll NOAA CDO and NWS Grid APIs, pushing historical actuals and forecasts through unit conversion and pricing pipelines.

Odds pricing uses the Normal distribution CDF to estimate bucket probabilities, applies house edge calculations, and adjusts for jackpot bonuses based on prediction accuracy. I validated profitability with an automated backtesting simulator:

Simulation Profile	Bets	Win Rate	Margin
Dumb Bot	90	24.4%	12.5%
Sharp Value	12	33.3%	-5.8%
Random Gambler	90	11.1%	10.0%

The quantile bounds handle volatility effectively, converging to the 10% target margin for standard bettors.

The core insight: for risk-based systems, modeling distributions provides operational utility that point predictions cannot. MAE optimization is the wrong objective when the application prices uncertainty.

I later ported this pipeline to CardinalCast, a standalone Python/FastAPI implementation.

Ensembles require uncorrelated errors

Quantile regression for risk pricing

Feature selection and hyperparameter optimization

Deployment and validation

Feel Free to Reach Out