Automatically download tpcds benchmark data to the right place #19244

alamb · 2025-12-09T18:05:48Z

Which issue does this PR close?

Closes problems running tpcds benchmark #19243

Rationale for this change

I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark
automation scripts. To do so I need to be able to run bench.sh data tpchds and have
it automatically generate the data if it is not present.

Right now the data generation step is manual.

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

For TPC-DS data generation, please clone the datafusion-benchmarks repository:
  git clone https://github.com/apache/datafusion-benchmarks

And I think it takes some more post processing steps (which is what @mbutrovich hit)

What changes are included in this PR?

Update the data setup portion to automatically download the contents from github and extract it in the correct location

Are these changes tested?

I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly.

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

I also tested on my benchmark machine (linux)

Are there any user-facing changes?

alamb · 2025-12-09T18:08:50Z

@mbutrovich any chance you can test this via:

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

?

comphead · 2025-12-09T18:48:29Z

I also regenerated TPCDS data, and the number of 0 rows queries went down from 36 to 18
apache/datafusion-benchmarks#25

and considering independent nature of generating data and queries I think sometimes queries are out of sync, but also it might be still okay of having 0 rows as the computation still happens.

Anyway Im planning to modify DF TPCDS queries a little bit to see if I can tweak filters which exploited during query generation and make it return non-zero data

comphead

Thanks @alamb I think it should work.

I didn't do automatic cloning for benchmarks because:

the user prob have its own set of data
depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.

But I suppose this approach should work, thanks

alamb · 2025-12-09T20:19:26Z

Thanks @alamb I think it should work.

I didn't do automatic cloning for benchmarks because:

the user prob have its own set of data

depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.

Yeah, this is definitely a real potential issue. One thing we could do is document how to link to an existing checkout

something like this maybe:

mkdir -p data
ln -s -f $EXISITING_CHECKOUT/data/tpcds/sf1 data/tpcds_sf`

🤔

martin-g · 2025-12-10T13:48:12Z

benchmarks/bench.sh

-        echo "  git clone https://github.com/apache/datafusion-benchmarks"
-        echo ""
-        return 1
+    TPCDS_DIR="${DATA_DIR}/tpcds_sf1"


Move this to line 43 ?!
This way it will be defined once and reused by data_tpcds() and run_tpcds().

martin-g · 2025-12-10T13:49:19Z

benchmarks/bench.sh

+
+    # Check if `web_site.parquet` exists in the TPCDS data directory to verify data presence
+    echo "Checking TPC-DS data directory: ${TPCDS_DIR}"
+    if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then


Extract a variable for "${TPCDS_DIR}/web_site.parquet" at line 44 ? And reuse it in data_tpcds() and run_tpcds()

martin-g · 2025-12-10T13:52:12Z

benchmarks/bench.sh

+        # Download the DataFusion benchmarks repository zip if it is not already downloaded
+        if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then
+          echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip"
+          wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip


nit: Using git clone + git pull has the benefit of fetching the latest version of the datafusion-benchmarks repo. Using wget+unzip may become stale at some point and will require deleting the $TPCDS_DIR folder manually.

martin-g · 2025-12-10T13:55:30Z

benchmarks/bench.sh

    exit 1
 }

 # Points to TPCDS data generation instructions


The comment is obsolete.
The method does not point anymore, it actually downloads the repo.

martin-g · 2025-12-10T13:57:20Z

benchmarks/bench.sh

+        # Download the DataFusion benchmarks repository zip if it is not already downloaded
+        if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then
+          echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip"
+          wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip


Suggested change

wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip

wget --timeout=30 --tries=3 -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip

comphead · 2025-12-10T18:02:30Z

Thanks @martin-g and @alamb we can probably hardcode path, in most cases people cloned the benchmarks repo inside DF repo, so we can keep this path hardcoded having ${SCRIPTS_DIR} as anchor, if this path is empty we can fallback to DATA_DIR and if this fails again then throw an error, I'll try to play with it today

alamb added 2 commits December 9, 2025 12:58

Automatically download tpcds data

448d2df

cleanup

8adbf6f

alamb changed the title ~~Automatically download tpcds data~~ Automatically download tpcds data for benchmarks to the right place Dec 9, 2025

alamb marked this pull request as ready for review December 9, 2025 18:08

alamb requested review from comphead and mbutrovich December 9, 2025 18:08

alamb changed the title ~~Automatically download tpcds data for benchmarks to the right place~~ Automatically download tpcds benchmark data to the right place Dec 9, 2025

comphead approved these changes Dec 9, 2025

View reviewed changes

martin-g reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically download tpcds benchmark data to the right place #19244

Automatically download tpcds benchmark data to the right place #19244

alamb commented Dec 9, 2025

Uh oh!

alamb commented Dec 9, 2025

Uh oh!

comphead commented Dec 9, 2025

Uh oh!

comphead left a comment

Uh oh!

alamb commented Dec 9, 2025

Uh oh!

martin-g Dec 10, 2025

Uh oh!

martin-g Dec 10, 2025

Uh oh!

martin-g Dec 10, 2025

Uh oh!

martin-g Dec 10, 2025

Uh oh!

martin-g Dec 10, 2025

Uh oh!

comphead commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip
	wget --timeout=30 --tries=3 -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip

Automatically download tpcds benchmark data to the right place #19244

Are you sure you want to change the base?

Automatically download tpcds benchmark data to the right place #19244

Conversation

alamb commented Dec 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Dec 9, 2025

Uh oh!

comphead commented Dec 9, 2025

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 9, 2025

Uh oh!

martin-g Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

martin-g Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

comphead commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants