-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Automatically download tpcds benchmark data to the right place #19244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@mbutrovich any chance you can test this via: ./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds? |
|
I also regenerated TPCDS data, and the number of 0 rows queries went down from 36 to 18 and considering independent nature of generating data and queries I think sometimes queries are out of sync, but also it might be still okay of having 0 rows as the computation still happens. Anyway Im planning to modify DF TPCDS queries a little bit to see if I can tweak filters which exploited during query generation and make it return non-zero data |
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb I think it should work.
I didn't do automatic cloning for benchmarks because:
- the user prob have its own set of data
- depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.
But I suppose this approach should work, thanks
Yeah, this is definitely a real potential issue. One thing we could do is document how to link to an existing checkout something like this maybe: mkdir -p data
ln -s -f $EXISITING_CHECKOUT/data/tpcds/sf1 data/tpcds_sf`🤔 |
| echo " git clone https://github.com/apache/datafusion-benchmarks" | ||
| echo "" | ||
| return 1 | ||
| TPCDS_DIR="${DATA_DIR}/tpcds_sf1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to line 43 ?!
This way it will be defined once and reused by data_tpcds() and run_tpcds().
|
|
||
| # Check if `web_site.parquet` exists in the TPCDS data directory to verify data presence | ||
| echo "Checking TPC-DS data directory: ${TPCDS_DIR}" | ||
| if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract a variable for "${TPCDS_DIR}/web_site.parquet" at line 44 ? And reuse it in data_tpcds() and run_tpcds()
| # Download the DataFusion benchmarks repository zip if it is not already downloaded | ||
| if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then | ||
| echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip" | ||
| wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Using git clone + git pull has the benefit of fetching the latest version of the datafusion-benchmarks repo. Using wget+unzip may become stale at some point and will require deleting the $TPCDS_DIR folder manually.
| exit 1 | ||
| } | ||
|
|
||
| # Points to TPCDS data generation instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is obsolete.
The method does not point anymore, it actually downloads the repo.
| # Download the DataFusion benchmarks repository zip if it is not already downloaded | ||
| if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then | ||
| echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip" | ||
| wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| wget -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip | |
| wget --timeout=30 --tries=3 -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip |
|
Thanks @martin-g and @alamb we can probably hardcode path, in most cases people cloned the benchmarks repo inside DF repo, so we can keep this path hardcoded having |
Which issue does this PR close?
Rationale for this change
I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark
automation scripts. To do so I need to be able to run
bench.sh data tpchdsand haveit automatically generate the data if it is not present.
Right now the data generation step is manual.
And I think it takes some more post processing steps (which is what @mbutrovich hit)
What changes are included in this PR?
Are these changes tested?
I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly.
I also tested on my benchmark machine (linux)
Are there any user-facing changes?