Skip to content

Improved experience when remote object store URL does not end in / #16302

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I would like to make querying files from remote stores to be easy and a great experience in DataFusion, and datafusion-cli in particular.

While testing #16300, I tried this command:

datafusion-cli
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Object Store error: Object at location nyc_taxi_rides/data/tripdata_parquet not found: Error performing HEAD https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet in 142.679833ms - Server returned non-2xx status code: 404 Not Found:

This confused me for quite a while as that is a valid url (prefix)

The issue is that the url 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet' does not end in a /. If you add a / it then works great:

> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 1.624 seconds.

BTW this is consistent with a local file system where selecting from a directory that doesn't end in a path works just fine:

-- Write data to `foo` directory:
> copy (values(1)) to 'foo/1.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.044 seconds.

-- Note the location doesn't end in `/` but it works fine
> create external table foo stored as parquet location 'foo';
0 row(s) fetched.
Elapsed 0.022 seconds.

> select * from foo;
+---------+
| column1 |
+---------+
| 1       |
+---------+
1 row(s) fetched.
Elapsed 0.132 seconds.

Describe the solution you'd like

I would like this to be less confusing

Describe alternatives you've considered

Alternate 1: Better Error Message

At the very least we can make the message more explicit ("Not found. Hint: if it is a directory the path should end with /")

Alternate 2: Preferred

It would be even better to automatically add a/ to the path if the first one was not found and try again

I think the trick will be to figure out at what level we should try to add / (probably when first creating the ListingTable?)

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions