CALIFORNIA PM2.5
PREDICTION MODEL

Our flagship model demonstrates what we can build for your organization. Daily PM2.5 predictions across 129 California monitoring stations, powered by NASA/NOAA satellite data and LightGBM machine learning.

129
Monitoring Stations
EPA sites across California
5.13
RMSE (µg/m³)
LightGBM cross-validated error
56.9K
Training Samples
Daily observations after cleaning
18
Input Features
Satellite & meteorological variables

What This Model Does

The California PM2.5 model predicts fine particulate matter concentrations and breaks them down into their chemical components. This enables understanding of pollution sources and health impacts beyond simple mass measurements.

  • LightGBM gradient boosting with Bayesian optimization
  • Purged K-fold cross-validation with 30-day gap
  • 16.7% improvement over Elastic Net baseline
  • Robust to temporal autocorrelation
  • Validated against EPA ground monitoring
129
EPA CSN Stations
Real-time Predictions

From Raw Data to Predictions

A rigorous data science pipeline ensuring quality and reproducibility at every step.

01

Data Collection

Ingested 81,197 daily observations from 147 EPA stations across California, merged with HRRR meteorological data, GEOS-CF atmospheric composition, and MAIAC satellite AOD.

81,197 raw samples27 initial variables2-year timespan (2023-2024)
02

Data Cleaning

Dropped EPA speciated columns (94% missing), removed MAIAC water vapor (38% missing), and filtered incomplete rows. Retained complete cases only to maintain data quality.

56,931 clean samples21 final columns30% data reduction
Data Cleaning
03

Exploratory Analysis

Analyzed feature distributions, temporal patterns, and correlations. HRRR variables show expected meteorological patterns; MAIAC AOD captures aerosol loading variability.

18 predictor features129 active stationsNo remaining nulls
Exploratory Analysis
04

Model Training

Trained LightGBM with Bayesian hyperparameter optimization. Used Purged K-Fold CV with 30-day temporal gap to prevent data leakage from autocorrelation.

RMSE: 5.13 µg/m³R²: 0.3516.7% better than baseline

Exploratory Data Analysis

Before training any model, we conducted rigorous statistical analysis to understand data distributions, feature relationships, and spatio-temporal patterns. Here are the key findings.

Target Variable Distribution (PM2.5)

8.47
Mean (µg/m³)
6.8
Median (µg/m³)
7.82
Std Dev (µg/m³)
2.14
Skewness

Key Finding: PM2.5 distribution is right-skewed (skewness = 2.14), with most days showing good air quality but occasional high-pollution events reaching 98.2 µg/m³. This suggests the model must handle both typical conditions and extreme events.

EPA AQI Category Distribution

Good (0-12)71.2%
Moderate (12-35)24.8%
USG (35-55)3.2%
Unhealthy (>55)0.8%

Feature-Target Correlations

Top Predictive Features

geos_pm25_rh35_gccr = +0.423
GEOS-CF PM2.5 estimate
maiac_AOD_055r = +0.312
MAIAC Aerosol Optical Depth
maiac_AOD_047r = +0.298
MAIAC AOD (blue band)
geos_no2r = +0.187
GEOS-CF Nitrogen Dioxide
hrrr_t2mr = -0.142
HRRR 2m Temperature
hrrr_spr = -0.089
HRRR Surface Pressure

Correlation Insights

GEOS-CF PM2.5 shows the strongest correlation (r=0.42), validating that the NASA atmospheric model captures real PM2.5 variability.

MAIAC AOD features (r≈0.30) confirm satellite-derived aerosol optical depth as a strong PM2.5 proxy, consistent with peer-reviewed literature.

Temperature shows negative correlation (r=-0.14), indicating higher PM2.5 in cooler conditions—likely due to winter inversions and heating emissions.

Multicollinearity Check

maiac_AOD_047 ↔ maiac_AOD_055: r=0.97Expected - same instrument. Handled via tree-based model's implicit feature selection.

Seasonal Patterns

Winter
10.2 µg/m³
Fall
9.1 µg/m³
Summer
7.8 µg/m³
Spring
6.7 µg/m³

Winter shows 52% higher PM2.5 than Spring, driven by atmospheric inversions and residential heating.

Spatial Variation

Highest PM2.5 Stations

18.4 µg/m³
Fresno - Central Valley
16.2 µg/m³
Kern County - Bakersfield
14.8 µg/m³
Tulare - San Joaquin Valley

Lowest PM2.5 Stations

4.2 µg/m³
San Mateo - Coastal
4.5 µg/m³
Santa Cruz - Coastal
4.8 µg/m³
Marin County - Coastal

Central Valley stations show 3-4x higher PM2.5 than coastal sites—topographic basin trapping is a key factor.

EDA-Driven Model Design

Our exploratory analysis directly informed model architecture: the right-skewed target distribution guided our choice of tree-based methods over linear models; high multicollinearity between AOD bands validated LightGBM's implicit feature selection; and strong seasonal patterns justified temporal cross-validation with 30-day purging gaps to prevent data leakage.

7
EDA Dimensions
18
Features Analyzed

Key Input Features

The model combines satellite observations with meteorological data to predict ground-level PM2.5 concentrations.

AOD

Aerosol Optical Depth

MAIAC satellite aerosol loading

T2M

Temperature

HRRR 2m air temperature

RH

Relative Humidity

HRRR atmospheric moisture

WIND

Wind Speed

HRRR u/v wind components

PM25

GEOS PM2.5

NASA GEOS-CF model estimate

NO2

Nitrogen Dioxide

GEOS-CF trace gas

Data Sources

The model fuses data from multiple satellite and ground-based sources to create comprehensive feature sets for prediction.

NASA

GEOS-CF

Global atmospheric composition forecasts providing chemical concentrations and meteorological variables.

O3NO2COPM2.5TemperatureHumidity
NOAA

HRRR

High-Resolution Rapid Refresh model for detailed meteorological conditions.

Wind SpeedWind DirectionPrecipitationPressureCloud Cover
NASA

MAIAC

Multi-Angle Implementation of Atmospheric Correction for satellite aerosol optical depth.

AOD 047AOD 055Column Water VaporSurface Reflectance
EPA

CSN

Chemical Speciation Network ground truth measurements for model training and validation.

PM2.5 MassSO4NO3OCECDUSTSS

Want Something Similar?

We can build a custom environmental prediction model for your region, variables, and use case. Let's discuss your requirements.