Improve performance of `datafusion-cli` when reading from remote storage

### Is your feature request related to a problem or challenge?

- Part of https://github.com/apache/datafusion/pull/16300/files

While testing https://github.com/apache/datafusion/pull/16300, I (re-noticed) that datafusion-cli is really slow when reading remote files

For example, 

The initial table creation takes 7.5 seconds on my pretty crappy connection:
```shell
DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';

0 row(s) fetched.
Elapsed 7.492 seconds.
```

However then simple queries just to get the count take 8-10 seconds 😱 

```sql
> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 8.945 seconds.
```

```sql
> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 10.456 seconds.
```

I am almost certain this delay is due to having to read the footers of the parquet files for each query. 

Note the speed is much faster when `collect_statistics` is on

```sl
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.012 seconds.

> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.770 seconds.
```

The first query is still slow for some reason: (5 seconds)
```sql
> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 5.114 seconds.
```

But subsequent queries are quite fast:

```sql
> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 0.297 seconds.
```

### Describe the solution you'd like

I would like:
1. `datafusion-cli` to be faster for such queries
2. `datafusion-cli` to be an easy to follow model for how to cache metadata when working with ListingTabe that others who build with DataFusion could follow

### Describe alternatives you've considered

I think the obvious thing that is needed is a cache for the ParquetMetadata

I think the actual cache should be in `datafusion-cli` but NOT in the datafusion core crate as I think what and how to cache will be different across systems. 

What I envision is:
1. APIs in the ListingTable / RuntimeEnv / etc for adding caching of ParquetMetadata
2. An implementation of those APIs in `datafusion-cli`
3. Bonus Points: documentation / examples that show how to use those APIs in other system


There is some vestigal code in the cache_manager crate that I think could provide a home for such caching APIs: * https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html



### Additional context

Related issues
- https://github.com/apache/datafusion/issues/15585
- https://github.com/apache/datafusion/issues/15582

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of `datafusion-cli` when reading from remote storage #16365

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of datafusion-cli when reading from remote storage #16365

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improve performance of `datafusion-cli` when reading from remote storage #16365