Description
Is your feature request related to a problem or challenge?
While testing #16300, I (re-noticed) that datafusion-cli is really slow when reading remote files
For example,
The initial table creation takes 7.5 seconds on my pretty crappy connection:
DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.492 seconds.
However then simple queries just to get the count take 8-10 seconds 😱
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 8.945 seconds.
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 10.456 seconds.
I am almost certain this delay is due to having to read the footers of the parquet files for each query.
Note the speed is much faster when collect_statistics
is on
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.012 seconds.
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.770 seconds.
The first query is still slow for some reason: (5 seconds)
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 5.114 seconds.
But subsequent queries are quite fast:
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 0.297 seconds.
Describe the solution you'd like
I would like:
datafusion-cli
to be faster for such queriesdatafusion-cli
to be an easy to follow model for how to cache metadata when working with ListingTabe that others who build with DataFusion could follow
Describe alternatives you've considered
I think the obvious thing that is needed is a cache for the ParquetMetadata
I think the actual cache should be in datafusion-cli
but NOT in the datafusion core crate as I think what and how to cache will be different across systems.
What I envision is:
- APIs in the ListingTable / RuntimeEnv / etc for adding caching of ParquetMetadata
- An implementation of those APIs in
datafusion-cli
- Bonus Points: documentation / examples that show how to use those APIs in other system
There is some vestigal code in the cache_manager crate that I think could provide a home for such caching APIs: * https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html
Additional context
Related issues