Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 9, 2025

Which issue does this PR close?

Rationale for this change

I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark
automation scripts. To do so I need to be able to run bench.sh data tpchds and have
it automatically generate the data if it is not present.

Right now the data generation step is manual.

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

For TPC-DS data generation, please clone the datafusion-benchmarks repository:
  git clone https://github.com/apache/datafusion-benchmarks

And I think it takes some more post processing steps (which is what @mbutrovich hit)

What changes are included in this PR?

  1. Update the data setup portion to automatically download the contents from github and extract it in the correct location

Are these changes tested?

I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly.

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

I also tested on my benchmark machine (linux)

Are there any user-facing changes?

@alamb alamb changed the title Automatically download tpcds data Automatically download tpcds data for benchmarks to the right place Dec 9, 2025
@alamb alamb marked this pull request as ready for review December 9, 2025 18:08
@alamb alamb requested review from comphead and mbutrovich December 9, 2025 18:08
@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2025

@mbutrovich any chance you can test this via:

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

?

@alamb alamb changed the title Automatically download tpcds data for benchmarks to the right place Automatically download tpcds benchmark data to the right place Dec 9, 2025
@comphead
Copy link
Contributor

comphead commented Dec 9, 2025

I also regenerated TPCDS data, and the number of 0 rows queries went down from 36 to 18
apache/datafusion-benchmarks#25

and considering independent nature of generating data and queries I think sometimes queries are out of sync, but also it might be still okay of having 0 rows as the computation still happens.

Anyway Im planning to modify DF TPCDS queries a little bit to see if I can tweak filters which exploited during query generation and make it return non-zero data

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb I think it should work.

I didn't do automatic cloning for benchmarks because:

  • the user prob have its own set of data
  • depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.

But I suppose this approach should work, thanks

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2025

Thanks @alamb I think it should work.

I didn't do automatic cloning for benchmarks because:

  • the user prob have its own set of data
  • depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.

Yeah, this is definitely a real potential issue. One thing we could do is document how to link to an existing checkout

something like this maybe:

mkdir -p data
ln -s -f $EXISITING_CHECKOUT/data/tpcds/sf1 data/tpcds_sf`

🤔

echo " git clone https://github.com/apache/datafusion-benchmarks"
echo ""
return 1
TPCDS_DIR="${DATA_DIR}/tpcds_sf1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to line 43 ?!
This way it will be defined once and reused by data_tpcds() and run_tpcds().


# Check if `web_site.parquet` exists in the TPCDS data directory to verify data presence
echo "Checking TPC-DS data directory: ${TPCDS_DIR}"
if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract a variable for "${TPCDS_DIR}/web_site.parquet" at line 44 ? And reuse it in data_tpcds() and run_tpcds()

# Download the DataFusion benchmarks repository zip if it is not already downloaded
if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then
echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip"
wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using git clone + git pull has the benefit of fetching the latest version of the datafusion-benchmarks repo. Using wget+unzip may become stale at some point and will require deleting the $TPCDS_DIR folder manually.

exit 1
}

# Points to TPCDS data generation instructions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is obsolete.
The method does not point anymore, it actually downloads the repo.

# Download the DataFusion benchmarks repository zip if it is not already downloaded
if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then
echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip"
wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip
wget --timeout=30 --tries=3 -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip

@comphead
Copy link
Contributor

Thanks @martin-g and @alamb we can probably hardcode path, in most cases people cloned the benchmarks repo inside DF repo, so we can keep this path hardcoded having ${SCRIPTS_DIR} as anchor, if this path is empty we can fallback to DATA_DIR and if this fails again then throw an error, I'll try to play with it today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

problems running tpcds benchmark

3 participants