Description
Description
Propose further work that I'd like to perform regarding the creation reusable logical relations. Also helps with identifying relations we would need with substrait.
Delta Find Files
Purpose: Identify files that contain records that satisfy a predicate.
This relation will generate a record batch stream with a single column called path
. path
will then map to an Add
action in the Delta table.
This relation will also maintain a list of files that satisfy the predicate which can be passed sideways to relations downstream.
Delta Scan
Purpose: Scan the Delta Table
Update DeltaScan
to take an optional input stream that contains paths of files to be scanned. This will enable DeltaScan
to consume output of DeltaFindFile
.
Currently when using find files, we must wait for the entire operation to complete and then we build the scan. The change enables Delta Scan to start when the first candidate file is identified.
I think this will require some significant work since it will involve refactoring the current DeltaScan implementation.
Delta Write
Purpose: Write records to storage, conflict resolution, and commit creation
Takes an single input stream of data that matches that tables schema and creates Add
actions for each new file.
Information can be passed sideways to include additional delta actions to add to the commit. E.G DeltaDelete
can provide a stream of Remove
actions.
Delta Delete
Purpose: Delete Records from the table.
Given a predicate delete records from the Delta table.
Delta Delete can take an optional stream of records and will output records that do NOT satisfy the predicate.
It will maintain a stream of Remove
actions can be passed sideways to other operations downstream.
The input stream is optional since there are cases where delete determine which files to remove without a need for a scan. An optimization phase can help determine when this is the case.
Diagram
High level diagram of how these relation will connect.
┌───────────────────────┐
│ Delta Find Files │
│ │
│ Predicate: │
┌───┤ Version: │
│ │ │
│ └──────────┬────────────┘
│ │
│ ▼
Files │ ┌───────────────────────┐
Matched │ │ Delta Scan │
List │ │ │
│ │ Version: │
│ │ │
│ │ │
│ └──────────┬────────────┘
│ │
│ ▼
│ ┌───────────────────────┐
└──►│ Delta Delete │
│ │
│ Predicate: │
┌───┤ │
│ └──────────┬────────────┘
Remove │ │
Actions │ ▼
│ ┌───────────────────────┐
│ │ Delta Write │
└──►│ │
│ │
│ │
└───────────────────────┘
Converting the ReplaceWhere
operation to a logical view can look something like this
┌───────────────────────┐
│ Delta Find Files │
│ │
│ Predicate: │
┌───┤ Version: │
│ │ │
│ └──────────┬────────────┘
│ │
│ ▼ ┌────────────────────────────┐
Files │ ┌───────────────────────┐ │ Data Source │
Matched │ │ Delta Scan │ │ │
List │ │ │ │ │
│ │ Version: │ └────────────┬───────────────┘
│ │ │ │
│ │ │ ▼
│ └──────────┬────────────┘ ┌────────────────────────────┐
│ │ │ Delta Constraint Check │
│ ▼ │ │
│ ┌───────────────────────┐ └────────────┬───────────────┘
└──►│ Delta Delete │ │
│ │ │
│ Predicate: │ │
┌───┤ │ │
│ └──────────┬────────────┘ │
Remove │ │ │
Actions │ └────────────────┐ ┌─────────────┘
│ ▼ ▼
│ ┌──────────────────────────────────────────────────────────┐
│ │ Union │
│ │ │
│ └─────────────────────────┬────────────────────────────────┘
│ │
│ ▼
│ ┌───────────────────────┐
│ │ Delta Write │
└──────────────────────┤ │
│ │
│ │
└───────────────────────┘
Use Case
Once we have logical plans for Update and Delete we can expose new Datafusion SQL statements for them
May help with reuse of Delete & Update other for logical plans.
Related Issue(s)