Support correlated scalar subquery without aggregation #16137

findepi · 2025-05-21T12:55:18Z

Is your feature request related to a problem or challenge?

This is perfectly valid use of scalar subquery without aggregation.
This should work in DataFusion

WITH src AS (
    SELECT *
    FROM (VALUES
        (1, NULL, 'Europe'),
        (2, 1, 'Warsaw'),
        (3, 1, 'Paris')
    ) t(id, parent_id, name)
)
SELECT
    id,
    name,
    (SELECT p.name FROM src p WHERE p.id = s.parent_id) AS parent_name
FROM src s;

Currently this fails with

Invalid (non-executable) plan after Analyzer
caused by
Error during planning: Correlated scalar subquery must be aggregated to return at most one row

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

Scalar subquery (Expr::ScalarSubquery) does not fail when subquery is not scalar #16134

The text was updated successfully, but these errors were encountered:

logan-keede · 2025-05-25T07:51:22Z

datafusion/datafusion/expr/src/logical_plan/invariants.rs

Lines 174 to 201 in 34f250a

    
           if !subquery.outer_ref_columns.is_empty() { 
        
               match strip_inner_query(inner_plan) { 
        
                   LogicalPlan::Aggregate(agg) => { 
        
                       check_aggregation_in_scalar_subquery(inner_plan, agg) 
        
                   } 
        
                   LogicalPlan::Filter(Filter { input, .. }) 
        
                       if matches!(input.as_ref(), LogicalPlan::Aggregate(_)) => 
        
                   { 
        
                       if let LogicalPlan::Aggregate(agg) = input.as_ref() { 
        
                           check_aggregation_in_scalar_subquery(inner_plan, agg) 
        
                       } else { 
        
                           Ok(()) 
        
                       } 
        
                   } 
        
                   _ => { 
        
                       if inner_plan 
        
                           .max_rows() 
        
                           .filter(|max_row| *max_row <= 1) 
        
                           .is_some() 
        
                       { 
        
                           Ok(()) 
        
                       } else { 
        
                           plan_err!( 
        
                               "Correlated scalar subquery must be aggregated to return at most one row" 
        
                           ) 
        
                       } 
        
                   } 
        
               }?;

DataFusion tries to predict maximum rows possible instead of actually checking the number of rows.

DuckDB used to fix this by limiting the number of rows to one, now they have changed the behaviour to pass error by using something like case when rows> 0 then error(err_msg) into the projection itself .

┌──────────────┬────────────┐
│         PROJECTION        │
│    ────────────────────   │
│ CASE  WHEN ((#1 > 1)) THEN│
│  (error('More than one row│
│   returned by a subquery  │
│   used as an expression - │
│    scalar subqueries can  │
│  only return a single row.│
│          Use "SET         │
│ scalar_subquery_error_on_m│
│   ultiple_rows=false" to  │
│     revert to previous    │
│   behavior of returning a │
│ random row.')) ELSE #0 END│
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘

Reference PR

I tried using limit and got

>  WITH src AS (
    SELECT *
    FROM (VALUES
        (1, NULL, 'Europe'),
        (2, 1, 'Warsaw'),
        (3, 1, 'Paris')
    ) t(id, parent_id, name)
)
SELECT
    id,
    name,
    (SELECT p.name FROM src p WHERE p.id = s.parent_id limit 1) AS parent_name
FROM src s;
This feature is not implemented: Physical plan does not support logical expression ScalarSubquery(<subquery>)

which is weird to say the least. I do not understand why just adding a limit will make it different.

findepi added the enhancement New feature or request label May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support correlated scalar subquery without aggregation #16137

Support correlated scalar subquery without aggregation #16137

findepi commented May 21, 2025

logan-keede commented May 25, 2025

Uh oh!

Support correlated scalar subquery without aggregation #16137

Support correlated scalar subquery without aggregation #16137

Comments

findepi commented May 21, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

logan-keede commented May 25, 2025

Uh oh!