Benchmarking Results

DuckDB vs Spark Benchmarking Dashboard

This page compares single-node DuckDB execution with Spark execution on equivalent analytical workloads. It explains where lightweight local analytics are enough and where distributed processing becomes justified.

Engines Compared

DuckDB vs Spark

Single-node analytical engine compared with distributed Spark.

Benchmark Rows

Exported benchmark observations used in this dashboard.

Fastest Observed Run

0.127 sec

Fastest runtime across DuckDB and Spark benchmark columns.

Benchmark Summary

This benchmark compares DuckDB and Spark on equivalent analytical tasks. The goal is not to prove Spark is always faster. The goal is to show the tradeoff: DuckDB is excellent for compact local analytics, while Spark is appropriate when the same workflow scales to partitioned, multi-year, cloud-backed NOAA datasets.

Benchmark interpretation

DuckDB is expected to perform very well on smaller local analytical workloads because it avoids distributed scheduling overhead. Spark can look slower on small data, but it becomes valuable when the same workload needs to scale across much larger NOAA partitions, multiple years, many stations, or cloud storage.

DuckDB vs Spark Benchmark Runtime

Runtime comparison across equivalent analytical workloads. Lower bars are faster. DuckDB is faster on compact local workloads, while Spark is built for distributed scale.

Benchmark Runtime by Task

Exported benchmark visualization comparing DuckDB and Spark runtime across analytical workloads.

Spark Runtime Relative to DuckDB

Relative Spark runtime compared to DuckDB for the same benchmark tasks.

What this benchmark proves

DuckDB strength

Excellent for local, compact analytical workloads and fast iteration on exported Parquet or CSV data.

Spark strength

Better fit for large distributed processing, S3-backed data lakes, partitioned NOAA history, and full pipeline execution.

Project takeaway

The project uses both tools where they make sense: DuckDB for lean local analysis and Spark for scalable pipeline workloads.