Skip to content

Default to collecting statistics when creating LIstingTables #16158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue May 22, 2025 · 3 comments
Open

Default to collecting statistics when creating LIstingTables #16158

alamb opened this issue May 22, 2025 · 3 comments
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented May 22, 2025

Today, when creating tables of parquet files using CREATE EXTERNAL TABLE or ListingTables, statistics are not gathered.

This is good in the sense that creating the table is fast(er) but subsequent queries might be slower

The behavior is clarified in

@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):

I’ll also note that my personal preference would be to default to true purely because it took a surprising amount of work to figure out how to even report #15908 not knowing that statistics collection was a config option. I do see the rationale around the behavior change, though I’d say either way that flag is defaulted is a behavior change and true seems like a saner default.

Originally posted by @davisp in #16080 (review)

@alamb
Copy link
Contributor Author

alamb commented May 22, 2025

I suggest we change the default value of datafusion.execution.collect_statistics so statistics are collected on table creation

@brayanjuls
Copy link
Contributor

take

@davisp
Copy link
Member

davisp commented May 25, 2025

Registering my official +1 to default to collecting statistics.

For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s.

I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants