Default to collecting statistics when creating LIstingTables #16158

alamb · 2025-05-22T19:39:06Z

Today, when creating tables of parquet files using CREATE EXTERNAL TABLE or ListingTables, statistics are not gathered.

This is good in the sense that creating the table is fast(er) but subsequent queries might be slower

The behavior is clarified in

Clarify documentation about gathering statistics for parquet files #16157

@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):

I’ll also note that my personal preference would be to default to true purely because it took a surprising amount of work to figure out how to even report #15908 not knowing that statistics collection was a config option. I do see the rationale around the behavior change, though I’d say either way that flag is defaulted is a behavior change and true seems like a saner default.

Originally posted by @davisp in #16080 (review)

The text was updated successfully, but these errors were encountered:

alamb · 2025-05-22T19:40:08Z

I suggest we change the default value of datafusion.execution.collect_statistics so statistics are collected on table creation

brayanjuls · 2025-05-22T23:59:52Z

take

davisp · 2025-05-25T21:04:43Z

Registering my official +1 to default to collecting statistics.

For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s.

I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans).

alamb mentioned this issue May 22, 2025

Make SessionContext::register_parquet obey collect_statistics config #16080

Merged

alamb mentioned this issue May 22, 2025

Mismatched configuration between SessionContext::register_parquet and CREATE EXTERNAL TABLE #15908

Closed

github-actions bot assigned brayanjuls May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default to collecting statistics when creating LIstingTables #16158

Default to collecting statistics when creating LIstingTables #16158

alamb commented May 22, 2025

alamb commented May 22, 2025

Uh oh!

brayanjuls commented May 22, 2025

Uh oh!

davisp commented May 25, 2025

Uh oh!

Default to collecting statistics when creating LIstingTables #16158

Default to collecting statistics when creating LIstingTables #16158

Comments

alamb commented May 22, 2025

alamb commented May 22, 2025

Uh oh!

brayanjuls commented May 22, 2025

Uh oh!

davisp commented May 25, 2025

Uh oh!