Pipeline Architecture
From 600GB+ NOAA weather observations to a deployable live wind analytics platform.
The system separates large-scale historical processing from lightweight deployment. Spark and Airflow build trusted analytical and ML artifacts; the website and FastAPI backend consume those artifacts for dashboards, forecasting diagnostics, and live wind outlooks.
Processed stations
2,419
Verified live stations
1,981
Forecast evaluation rows
535,961
Historical window
1995–2025
State coverage
48 states
Final model
Spark MLlib GBT
End-to-end flow
Data lake → ML pipeline → preserved artifacts → live product
Each stage produces a concrete output that feeds the next layer. The final website remains usable even when the distributed cloud environment is unavailable.
Raw NOAA ISD
AWS Open Data
Hourly station-year CSV files are discovered from NOAA ISD. The raw dataset is global, wide, sparse, and encoded.
Output
station-year CSV inputs
Bronze
Raw ingestion
Spark reads many NOAA files in parallel and writes normalized raw Parquet outputs to reduce small-file overhead.
Output
bronze/isd
Silver
Parsing + cleaning
Encoded fields like WND, TMP, DEW, VIS, CIG, and SLP are parsed, quality-controlled, standardized, and enriched with station metadata.
Output
silver/weather
Gold
Wind analytics
Clean observations are converted into turbine-inspired wind potential metrics and aggregated into station, state, and regional tables.
Output
gold/wind
Feature Engineering
Forecast table
Lag, rolling, temporal, regional, and long-run state features are assembled into an ML-ready forecasting table.
Output
gold/wind/ml/base
Spark ML
Historical forecasting
Spark MLlib models are trained and evaluated using time-based splits. The final tuned GBT predicts next-day regional capacity factor.
Output
final_tuned_gbt
Artifacts
Website preservation
Forecasts, metrics, station lists, trends, benchmark results, and figures are exported as lightweight CSV/JSON/image files.
Output
website/public/data
Website + API
Portable product
Next.js dashboards and FastAPI live analysis consume preserved artifacts and live NOAA observations without needing Spark at runtime.
Output
/live, /results, /forecasting
Heavy processing layer
- PySpark ETL
- NOAA ISD parsing
- quality control
- gold table generation
- feature engineering
- Spark MLlib training
Portable product layer
- Next.js website
- FastAPI live analysis service
- CSV/JSON artifacts
- static figures
- verified station lists
- NOAA/NWS live observations
What survives without EC2/S3
- historical dashboards
- forecasting diagnostics
- benchmark dashboards
- live station explorer
- live wind outlook
- model metrics and interpretation
Historical ML path
Spark model training and evaluation
The forecasting model is trained offline using Spark MLlib on a time-based split. Its predictions are exported into forecast-vs-actual artifacts so the website can show honest holdout evaluation without requiring a live Spark cluster.
Live product path
NOAA-powered live wind outlook
The live service does not pretend to serve the Spark model. It fetches current NOAA observations, computes a turbine-inspired capacity factor, compares against preserved historical state summaries, and returns a deployable next-24-hour outlook.