Skip to content

Support correlated scalar subquery without aggregation #16137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
findepi opened this issue May 21, 2025 · 1 comment
Open

Support correlated scalar subquery without aggregation #16137

findepi opened this issue May 21, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@findepi
Copy link
Member

findepi commented May 21, 2025

Is your feature request related to a problem or challenge?

This is perfectly valid use of scalar subquery without aggregation.
This should work in DataFusion

WITH src AS (
    SELECT *
    FROM (VALUES
        (1, NULL, 'Europe'),
        (2, 1, 'Warsaw'),
        (3, 1, 'Paris')
    ) t(id, parent_id, name)
)
SELECT
    id,
    name,
    (SELECT p.name FROM src p WHERE p.id = s.parent_id) AS parent_name
FROM src s;

Currently this fails with

Invalid (non-executable) plan after Analyzer
caused by
Error during planning: Correlated scalar subquery must be aggregated to return at most one row

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

@findepi findepi added the enhancement New feature or request label May 21, 2025
@logan-keede
Copy link
Contributor

if !subquery.outer_ref_columns.is_empty() {
match strip_inner_query(inner_plan) {
LogicalPlan::Aggregate(agg) => {
check_aggregation_in_scalar_subquery(inner_plan, agg)
}
LogicalPlan::Filter(Filter { input, .. })
if matches!(input.as_ref(), LogicalPlan::Aggregate(_)) =>
{
if let LogicalPlan::Aggregate(agg) = input.as_ref() {
check_aggregation_in_scalar_subquery(inner_plan, agg)
} else {
Ok(())
}
}
_ => {
if inner_plan
.max_rows()
.filter(|max_row| *max_row <= 1)
.is_some()
{
Ok(())
} else {
plan_err!(
"Correlated scalar subquery must be aggregated to return at most one row"
)
}
}
}?;

DataFusion tries to predict maximum rows possible instead of actually checking the number of rows.

DuckDB used to fix this by limiting the number of rows to one, now they have changed the behaviour to pass error by using something like case when rows> 0 then error(err_msg) into the projection itself .

┌──────────────┬────────────┐
│         PROJECTION        │
│    ────────────────────   │
│ CASE  WHEN ((#1 > 1)) THEN│
│  (error('More than one row│
│   returned by a subquery  │
│   used as an expression - │
│    scalar subqueries can  │
│  only return a single row.│
│          Use "SET         │
│ scalar_subquery_error_on_m│
│   ultiple_rows=false" to  │
│     revert to previous    │
│   behavior of returning a │
│ random row.')) ELSE #0 END│
│                           │
│          ~1 Rows          │
└─────────────┬─────────────┘

Reference PR

I tried using limit and got

>  WITH src AS (
    SELECT *
    FROM (VALUES
        (1, NULL, 'Europe'),
        (2, 1, 'Warsaw'),
        (3, 1, 'Paris')
    ) t(id, parent_id, name)
)
SELECT
    id,
    name,
    (SELECT p.name FROM src p WHERE p.id = s.parent_id limit 1) AS parent_name
FROM src s;
This feature is not implemented: Physical plan does not support logical expression ScalarSubquery(<subquery>)

which is weird to say the least. I do not understand why just adding a limit will make it different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants