Skip to content

Improve performance of datafusion-cli when reading from remote storage #16365

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While testing #16300, I (re-noticed) that datafusion-cli is really slow when reading remote files

For example,

The initial table creation takes 7.5 seconds on my pretty crappy connection:

DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';

0 row(s) fetched.
Elapsed 7.492 seconds.

However then simple queries just to get the count take 8-10 seconds 😱

> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 8.945 seconds.
> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 10.456 seconds.

I am almost certain this delay is due to having to read the footers of the parquet files for each query.

Note the speed is much faster when collect_statistics is on

> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.012 seconds.

> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.770 seconds.

The first query is still slow for some reason: (5 seconds)

> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 5.114 seconds.

But subsequent queries are quite fast:

> select count(*) from nyc_taxi_rides;
+------------+
| count(*)   |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 0.297 seconds.

Describe the solution you'd like

I would like:

  1. datafusion-cli to be faster for such queries
  2. datafusion-cli to be an easy to follow model for how to cache metadata when working with ListingTabe that others who build with DataFusion could follow

Describe alternatives you've considered

I think the obvious thing that is needed is a cache for the ParquetMetadata

I think the actual cache should be in datafusion-cli but NOT in the datafusion core crate as I think what and how to cache will be different across systems.

What I envision is:

  1. APIs in the ListingTable / RuntimeEnv / etc for adding caching of ParquetMetadata
  2. An implementation of those APIs in datafusion-cli
  3. Bonus Points: documentation / examples that show how to use those APIs in other system

There is some vestigal code in the cache_manager crate that I think could provide a home for such caching APIs: * https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html

Additional context

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions