You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):
I’ll also note that my personal preference would be to default to true purely because it took a surprising amount of work to figure out how to even report #15908 not knowing that statistics collection was a config option. I do see the rationale around the behavior change, though I’d say either way that flag is defaulted is a behavior change and true seems like a saner default.
Registering my official +1 to default to collecting statistics.
For reference, I was working on the TPC-H benchmarks with a scale factor of 20 which generates roughly 20GiB of CSV data to process. Without statistics, query 17 or 18 would OOM on a 32GiB machine after about 20s. With statistics it never uses more than 1.9G and finishes in about 5s.
I generally agree that statistics collection is probably a bit slower than not collecting, but my guess is that any folks that notice and/or care would be doing something like high frequency queries against tiny datasets which seems like a niche use case. However, I’d wager a minute amount of money that the benefits would be worth the cost around the 500MiB to 1GiB range (based on my assumption that statistics collection is just scanning row groups, not full table scans).
Today, when creating tables of parquet files using
CREATE EXTERNAL TABLE
or ListingTables, statistics are not gathered.This is good in the sense that creating the table is fast(er) but subsequent queries might be slower
The behavior is clarified in
@davisp suggests that defaulting to collecting statistics would make more sense (and I agree):
Originally posted by @davisp in #16080 (review)
The text was updated successfully, but these errors were encountered: