Pipeline Architecture

From 600GB+ NOAA weather observations to a deployable live wind analytics platform.

The system separates large-scale historical processing from lightweight deployment. Spark and Airflow build trusted analytical and ML artifacts; the website and FastAPI backend consume those artifacts for dashboards, forecasting diagnostics, and live wind outlooks.

Processed stations

2,419

Verified live stations

1,981

Forecast evaluation rows

535,961

Historical window

1995–2025

State coverage

48 states

Final model

Spark MLlib GBT

End-to-end flow

Data lake → ML pipeline → preserved artifacts → live product

Each stage produces a concrete output that feeds the next layer. The final website remains usable even when the distributed cloud environment is unavailable.

Raw NOAA ISD

AWS Open Data

Hourly station-year CSV files are discovered from NOAA ISD. The raw dataset is global, wide, sparse, and encoded.

Output

station-year CSV inputs

Bronze

Raw ingestion

Spark reads many NOAA files in parallel and writes normalized raw Parquet outputs to reduce small-file overhead.

Output

bronze/isd

Silver

Parsing + cleaning

Encoded fields like WND, TMP, DEW, VIS, CIG, and SLP are parsed, quality-controlled, standardized, and enriched with station metadata.

Output

silver/weather

Gold

Wind analytics

Clean observations are converted into turbine-inspired wind potential metrics and aggregated into station, state, and regional tables.

Output

gold/wind

Feature Engineering

Forecast table

Lag, rolling, temporal, regional, and long-run state features are assembled into an ML-ready forecasting table.

Output

gold/wind/ml/base

Spark ML

Historical forecasting

Spark MLlib models are trained and evaluated using time-based splits. The final tuned GBT predicts next-day regional capacity factor.

Output

final_tuned_gbt

Artifacts

Website preservation

Forecasts, metrics, station lists, trends, benchmark results, and figures are exported as lightweight CSV/JSON/image files.

Output

website/public/data

Website + API

Portable product

Next.js dashboards and FastAPI live analysis consume preserved artifacts and live NOAA observations without needing Spark at runtime.

Output

/live, /results, /forecasting

Heavy processing layer

PySpark ETL
NOAA ISD parsing
quality control
gold table generation
feature engineering
Spark MLlib training

Portable product layer

Next.js website
FastAPI live analysis service
CSV/JSON artifacts
static figures
verified station lists
NOAA/NWS live observations

What survives without EC2/S3

historical dashboards
forecasting diagnostics
benchmark dashboards
live station explorer
live wind outlook
model metrics and interpretation

Historical ML path

Spark model training and evaluation

The forecasting model is trained offline using Spark MLlib on a time-based split. Its predictions are exported into forecast-vs-actual artifacts so the website can show honest holdout evaluation without requiring a live Spark cluster.

NOAA history → Spark features → final_tuned_gbt → forecast outputs → website diagnostics

Live product path

NOAA-powered live wind outlook

The live service does not pretend to serve the Spark model. It fetches current NOAA observations, computes a turbine-inspired capacity factor, compares against preserved historical state summaries, and returns a deployable next-24-hour outlook.

NOAA live API → FastAPI service → power curve → historical context → live outlook