Skip to content

GH-46636: [R] Fix evaluation of external objects not in global environment in case_when() #46667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 5, 2025

Conversation

etiennebacher
Copy link
Contributor

@etiennebacher etiennebacher commented Jun 2, 2025

Rationale for this change

When a script is called in an environment that isn't the global environment (for instance with source("my-script.R", local = new.env()), case_when() would fail to detect external objects used in conditions.

This PR fixes this behavior.

Fixes #46636

What changes are included in this PR?

When evaluating expressions in dplyr functions, eval_tidy() now takes into account mask as an environment where it should look for external objects.

@thisisnic suggested in #46636 that the bug might be due to https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-conditional.R#L116 but I couldn't find a way to fix it there.

Are these changes tested?

I added a test for this scenario. I ensured it failed before the change and succeeds after.

Are there any user-facing changes?

There is one user-facing, non-breaking change, illustrated both in the related issue and in the new test.

Copy link

github-actions bot commented Jun 2, 2025

⚠️ GitHub issue #46636 has been automatically assigned in GitHub to PR creator.

@etiennebacher etiennebacher marked this pull request as ready for review June 2, 2025 09:20
@etiennebacher
Copy link
Contributor Author

etiennebacher commented Jun 2, 2025

I don't know the cause of the "Dev" workflow failure, it just says the job was cancelled at some point.

Edit: looks related to #46668

@thisisnic
Copy link
Member

thisisnic commented Jun 2, 2025

Thanks for the PR @etiennebacher, much appreciated!

OK, this is a bit long-winded, but I'm a bit unsure about something so wanted to spell out my reasoning.

  • In arrow_mask(), we take the functions created in the .cache environment and add them to the mask in order to be able to use our bindings from R functions to Arrow compute functions here. We add the column references to the mask too.
  • In arrow_eval() we call add_user_functions_to_mask() (which is used to enable user-created R functions in Arrow without needing to use the UDF functionality), which looks for R functions in expressions being evaluated which are currently absent from the mask, and see if we can find them in the mask's "grandparent" environment and adds them to the data mask.
    • In the comment in add_user_functions_to_mask() we reference case_when() which makes me think that this gives a clue to why it specifically doesn't work there.
  • When we now pass mask to the env param in this PR (eval_tidy(expr, data = mask, env = mask)), we get to add those items manually
    • Also: "Objects in data have priority over those in env" (from the eval_tidy() docs).

Source of the issue:

  • In case_when() the mask passed to arrow_eval() is based on the caller_env().
    • We also do this in other places, e.g. bindings for verbs like mutate() but I can't reproduce the original issue when using those functions.
      • But in mutate() we use expand_across() which calls quo_get_env() which does the job of getting the environment - (from the docs:) "A quosure is guaranteed to evaluate in its original environment and can refer to local objects safely".
      • I don't know what "safely" means here, but I'm wondering if there's some reason we might want case_when() to be able to refer to the relevant environment instead of blanket updating for eval_tidy(), if there's a reason we didn't pass the mask in as env beyond just leaving things as the default?
        • So for example, in case_when(), where we have the code mask <- caller_env(), maybe instead we actually want mask <- caller_env(n=2) or something like that, because if that works we've got less risk of unintended consequences?

Or, are there no unintended consequences likely (as data takes precedent over env anyway), and we can just do this as-is and everyone's happy.

cc @nealrichardson - is this good to go as-is?

@etiennebacher
Copy link
Contributor Author

etiennebacher commented Jun 2, 2025

in case_when(), where we have the code mask <- caller_env(), maybe instead we actually want mask <- caller_env(n=2) or something like that, because if that works we've got less risk of unintended consequences?

Thanks for the detailed reasoning. I should have probably explained a bit more what I had tried.

My first attempt was indeed to use caller_env(2) and I thought it would easily solve the issue, but it turns out this generates a warning, in addition to the original error (I locally reverted the change in arrow_eval() so I'm basically using the main branch):

Passing an environment as data mask is deprecated. Please use new_data_mask() to transform your environment to a mask.

env <- env(foo = "bar")

# Bad: as_data_mask(env) eval_tidy(expr, env)

# Good: mask <- new_data_mask(env) eval_tidy(expr, mask)

This doesn't make any sense to me, caller_env() and caller_env(2) both return an environment, but only the latter triggers this warning.

I tried to play with the n in caller_env(), with the parent environments, but nothing worked. In the end, I tried this:

  • I put browser() after mask <- caller_env(), ran source("foo.R", local = new.env()), and exported its value with mask_local <<- mask
  • did the same with source("foo.R") and exported the value to mask_global

(it's important to do local first, and then global).

Here's the output of waldo::compare():

> waldo::compare(mask_global, mask_local)
`old$.data$x$.:xp:.` is <pointer: 0x000001b2a2bee6d0>
`new$.data$x$.:xp:.` is <pointer: 0x000001b2a2bee330>

`old$.data$x$schema$.:xp:.` is <pointer: 0x000001b2a2bee7d0>
`new$.data$x$schema$.:xp:.` is <pointer: 0x000001b2a2bee3d0>

`old$.data$x$schema$fields[[1]]$.:xp:.` is <pointer: 0x000001b2a2bee350>
`new$.data$x$schema$fields[[1]]$.:xp:.` is <pointer: 0x000001b2a2bee650>

`old$.data$x$schema$fields[[1]]$type$.:xp:.` is <pointer: 0x000001b2a2bee750>
`new$.data$x$schema$fields[[1]]$type$.:xp:.` is <pointer: 0x000001b2a2bee870>

`parent.env(old$.env)` is <env:global>
`parent.env(new$.env)` is <env:0x000001b2a6293438>

`parent.env(old$.top_env)` is <env:global>
`parent.env(new$.top_env)` is <env:0x000001b2a6293438>

I see that the parent environment is different, but I can't figure out where this could be fixed. I'm open to suggestions on this, I agree that modifying eval_tidy() just for this is a bit clumsy, but I thought if this had unanticipated consequences then the test suite would catch those.

@thisisnic
Copy link
Member

Thanks for the extra context there, that's super helpful! I agree it should probably be in the test suite if anything else could affect it though there's always weird edge cases. I'll give it a few days so other folks get the chance to give it a look in case I'm missing anything here, but if not, happy to merge. Thanks again!

@nealrichardson
Copy link
Member

cc @nealrichardson - is this good to go as-is?

As long as the new test fails without the fix and it passes with it, I guess so? I don't understand how passing the mask in eval_tidy() twice would make a difference, but if everything is working and tests cover it, I don't need to understand 🤠

@thisisnic
Copy link
Member

cc @nealrichardson - is this good to go as-is?

As long as the new test fails without the fix and it passes with it, I guess so? I don't understand how passing the mask in eval_tidy() twice would make a difference, but if everything is working and tests cover it, I don't need to understand 🤠

Cool, just checking there wasn't anything "obvious" I was missing. I'll approve then. Cheers @etiennebacher for fixing this!

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks for fixing!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 2, 2025
@thisisnic thisisnic merged commit 835992c into apache:main Jun 5, 2025
9 of 10 checks passed
@thisisnic thisisnic removed the awaiting merge Awaiting merge label Jun 5, 2025
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 835992c.

There were 67 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

alinaliBQ pushed a commit to Bit-Quill/arrow that referenced this pull request Jun 17, 2025
…environment in `case_when()` (apache#46667)

### Rationale for this change

When a script is called in an environment that isn't the global environment (for instance with `source("my-script.R", local = new.env())`, `case_when()` would fail to detect external objects used in conditions.

This PR fixes this behavior.

Fixes apache#46636 

### What changes are included in this PR?

When evaluating expressions in `dplyr` functions, `eval_tidy()` now takes into account `mask` as an environment where it should look for external objects.

@ thisisnic suggested in apache#46636 that the bug might be due to https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs-conditional.R#L116 but I couldn't find a way to fix it there.

### Are these changes tested?

I added a test for this scenario. I ensured it failed before the change and succeeds after.

### Are there any user-facing changes?

There is one user-facing, non-breaking change, illustrated both in the related issue and in the new test.

* GitHub Issue: apache#46636

Authored-by: etiennebacher <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] case_when() sometimes doesn't find existing objects in the environment
4 participants