Skip to content

Document jj scoping behavior in groupingsets() with practical examples #6879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Mukulyadav2004
Copy link
Contributor

closes #5560
In this PR I improved documentation for the jj argument in groupingsets(), explaining how jj = substitute(expr) preserves both expression and environment context when j would fail with outer-environment variables.
The documentation now explains why users encounter "object not found" errors when using variables from outer environments with j, and how jj solves this problem.

Key changes in groupingsets.Rd:
1.)Added detailed explanation of scoping behavior in the \details section
2.)Added three practical examples demonstrating different jj usage patterns:
Basic usage with list and lapply
Simple aggregation
Named results with direct column references

These changes directly address the request for "more examples on the use of the jj argument" and will help users understand when and how to use this parameter in their own functions.

@MichaelChirico MichaelChirico requested review from jangorecki and removed request for MichaelChirico March 21, 2025 18:14
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.59%. Comparing base (6aa99f5) to head (893b76a).
Report is 47 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6879   +/-   ##
=======================================
  Coverage   98.59%   98.59%           
=======================================
  Files          79       79           
  Lines       14661    14685   +24     
=======================================
+ Hits        14455    14479   +24     
  Misses        206      206           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please see if quote() won't be sufficient where substitute was used

NEWS.md Outdated
@@ -27,6 +27,8 @@
+ Argument `in.place` to `droplevels` has been removed.
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0.

3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr) instead, which captures both the expression and its environment per R’s scoping rules (§6). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what exactly (§6) refers to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It represents chapter , I'll update it to ch for better readability

Copy link
Member

@jangorecki jangorecki Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chapter 6 is from R language manual, one of the few official R manuals (R lang, R admin, WRE, ...), therefore specifying that would be desired as well.

@@ -81,5 +83,15 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"
# groupingsets
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"),
sets = list("color", c("year","status"), character()), id=TRUE)
# Using jj argument(enables access to variables from outer environments)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space is missing

@@ -81,5 +83,15 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"
# groupingsets
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"),
sets = list("color", c("year","status"), character()), id=TRUE)
# Using jj argument(enables access to variables from outer environments)
groupingsets(DT, jj = substitute(c(list(count=.N), lapply(.SD, sum))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't quote() be sufficient here? Substitute will be good when using from inside a function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right that substitute will be good when using from inside a function and as substitution is necessary when capturing variables from the function's environment but here all variables are within data.table(like column)
But after taking a closer look at groupingsets() implementation to check whether quote() would work instead of substitute() in which after reviewing the code, I think substitute() is indeed the right approach as the whole codebase consistently use substitute()

here are snippets in which they behave the same way whether direcly calling groupingsets() or using functions like rollup() or cube()

jj = if (!missing(jj)) jj else substitute(j) and jj = substitute(j)

this code evaluates eval(jj) without specifying environment , which works correctly as it brings both the expression and its environment.
So I think using substitute() keeps things consistent with how the functions are built to work, even if quote() might be sufficient here .
But I'm open to use quote() in place of substitute() in examples if you feel it would be clearer for users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If quote() will be sufficient then it should be used instead of substitute because it will be as simple as necessary, substitute is more powerful and may create impression that it is necessary there. While it is necessary only from inside a function, that is why is used in rollup and cube. One example of function wrapping groupingsets and use of substitute would therefore be useful as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your clarification. I updated the description of jj usage - include when to use quote() and when substitute() and examples to use quote() for the direct calls to groupingsets().
I also added an example showing when substitute() is actually needed - inside a wrapper function where variables from the outer environment need to be captured.

Could you please review these changes and let me know if any further modifications are needed? Thank you!

NEWS.md Outdated
@@ -27,6 +27,8 @@
+ Argument `in.place` to `droplevels` has been removed.
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0.

3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr) instead, which captures both the expression and its environment per R’s scoping rules (Chapter 6 of the R Language Definition manual). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns.
Copy link
Contributor

@aitap aitap Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substitute() does not capture the environment (that's the problem that quosures are intended to solve). What it does is access the list or environment given to substitute() as an argument (which defaults to the caller's environment) at the time of the call. Any symbol that can be directly found in that environment (ignoring the parent chain) is substituted: promises are replaced with the promise expression (losing the promise environment in the process!), other values are substituted directly.

# Example showing when substitute() is necessary - in a wrapper function
custom_grouping <- function(DT, multiplier = 1) {
# Here substitute() is needed to capture 'multiplier' from function environment
groupingsets(DT, jj = substitute(list(count = .N, total = sum(value) * multiplier)),
Copy link
Contributor

@aitap aitap Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, a plain substitute() is not a reliable solution here. Consider:

local({ foo <- 3; custom_grouping(DT, multiplier=foo) })

substitute(list(count = .N, total = sum(value) * multiplier)) expands the multiplier promise into foo, which is not a variable that custom_grouping is supposed to know about, so groupingsets fails to find foo (since it lives in a local environment that environment(custom_grouping) does not inherit from).

What does work is substitute(list(count = .N, total = sum(value) * multiplier), list(multiplier = multiplier)), because with a list argument to substitute, multiplier is a plain value, not an argument promise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining the nuance about substitute() with nested environments. I'll implement your suggested approach using substitute(expr, list(param=value)) in the example to properly handle the multiplier parameter regardless of call context.

@Mukulyadav2004
Copy link
Contributor Author

Hi @aitap @jangorecki,
I have implemented the changes as suggested. Could you please review them and let me know if there are any further suggestions? I am open to feedback.
Thank you

sets = list("color", c("year", "status"), character()), id = TRUE)
}
# Using the wrapper function
custom_grouping(DT, multiplier = 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not much gain providing value to be substituted. Shouldn't this be a symbol to be substituted? For current example j argument will work well enough, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right , here symbol from outer environment should be used . I updated this example to use this .

by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()), id = TRUE)
}
# Using the wrapper function
custom_grouping(DT, multiplier = 2)
local({ foo <- 3; custom_grouping(DT, multiplier = foo) })
Copy link
Member

@jangorecki jangorecki Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is overly complex now. We don't need local(), just something similar to cube/rollup, where user can pass column names as symbols (rather than a value as it is now) to his/her wrapper function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so does by simply replacing multiplier = 2, to a parameter holding a symbol works here ? , by converting amount to symbol in groupingsets .
Here amount is defined early before the examples as one of the parameter like color , value and status in data.table.
For eg .
custom_grouping <- function(DT, value_col) {
groupingsets(DT, jj = substitute(list(count = .N, total = sum(col)),
list(col = as.name(value_col))),
by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()),
id = TRUE)
}
custom_grouping(DT, value_col = "amount")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would like an example of function that substitutes a j argument and gives it to groupingsets as jj?

Copy link
Contributor Author

@Mukulyadav2004 Mukulyadav2004 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd like to provide an example as I believe that since j evaluates expressions directly in groupingsets()'s environment, it can't resolve dynamic symbols like as.name(value_col). Using substitute() for jj replaces the placeholder with the actual column name (amount), ensuring that sum(amount) correctly operates on the data.
I hope that makes sense—I'd appreciate hearing your thoughts on whether this seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mukulyadav2004 It would be great!
I was going to ask for such an example, just using substitute2 instead of substitute and making explicit that list(col = as.name(value_col)) is the env argument: env = list(col = as.name(value_col)).

Thank you for this work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. You're right that this function doesn't use NSE since the input is a string. But as per requirement of example that uses substitute2() with an explicit env = list(col = as.name(value_col)), here below is complete example-

custom_grouping <- function(DT, value_col) {
  groupingsets(DT, jj = substitute2(list(count = .N, total = sum(col)), env = list(col = as.name(value_col))),
    by = c("color", "year", "status"),
    sets = list("color", c("year", "status"), character()),
    id = TRUE
  )
}
custom_grouping(DT, "amount")

I hope this aligns with your intent, but if there's a better way you’d suggest approaching this, I’d really appreciate your guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this one a better way?

custom_grouping <- function(DT, value_col) {
  groupingsets(DT, jj = substitute2(list(count = .N, total = sum(col)), env = list(col = value_col)),
    by = c("color", "year", "status"),
    sets = list("color", c("year", "status"), character()),
    id = TRUE
  )
}
custom_grouping(DT, "amount")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the use of substitute2() here, one function call less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent is to use an example function that accepts a j as an argument and then forwards it as the jj argument to groupingsets. Let's not document the workaround of using substitute to insert j's dependencies into the expression (it won't help a wrapper that accepts an arbitrary j); there is a more reliable fix for the underlying problem in #6899.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I’ve added example that accepts an arbitrary j and forwards it with substitute(j) to groupingsets(), now that #6899 handles the environment correctly, this removes need for relying on substitute2() or env workarounds. Let me know if you'd like any further adjustments.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally got some time to elaborate a bit more

@@ -34,6 +34,8 @@ groupingsets(x, \dots)
The \code{label} argument can be a named list of scalars, or a scalar, or \code{NULL}. When \code{label} is a list, each element name must be (1) a variable name in \code{by}, or (2) the first element of the class in the data.table \code{x} of a variable in \code{by}, or (3) one of 'character', 'integer', 'numeric', 'factor', 'Date', 'IDate'. The order of the list elements is not important. A label specified by variable name will apply only to that variable, while a label specified by first element of a class will apply to all variables in \code{by} for which the first element of the class of the variable in \code{x} matches the \code{label} element name, except for variables that have a label specified by variable name (that is, specification by variable name takes precedence over specification by class). For \code{label} elements with name in \code{by}, the class of the label value must be the same as the class of the variable in \code{x}. For \code{label} elements with name not in \code{by}, the first element of the class of the label value must be the same as the \code{label} element name. For example, \code{label = list(integer = 999, IDate = as.Date("3000-01-01"))} would produce an error because \code{class(999)[1]} is not \code{"integer"} and \code{class(as.Date("3000-01-01"))[1]} is not \code{"IDate"}. A corrected specification would be \code{label = list(integer = 999L, IDate = as.IDate("3000-01-01"))}.

The \code{label = <scalar>} option provides a shorter alternative in the case where only one class of grouping variable requires a label. For example, \code{label = list(character = "Total")} can be shortened to \code{label = "Total"}. When this option is used, the label will be applied to all variables in \code{by} for which the first element of the class of the variable in \code{x} matches the first element of the class of the scalar.

The \code{jj} argument provides a mechanism to handle scoping issues when working with variables from outer environments. When using \code{j} directly with variables not defined in the columns of the data.table, R's evaluation rules may not find these variables, resulting in "object not found" errors. The \code{jj} argument accepts a quoted expression which can be created with either \code{quote()} (for direct calls referencing only columns and special symbols) or \code{substitute()} (when inside functions that need to capture parameters from the outer environment). For reliable behavior with function parameters, especially in nested environments, use the two-argument form \code{substitute(expr, list(param=value))} which binds values rather than symbols. This approach is particularly useful in functions where parameters need to be referenced within aggregation expressions. When \code{jj} is provided, the function will ignore any value specified in the \code{j} argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description of jj in the manual is as follows

quoted version of j argument, for convenience. When provided function will ignore j argument.

It sounds quite different than what was described in your paragraph here.

Your description focuses on a single example of misuse of jj. It is not really related to data.table's implementation here, you will face the same problem whenever you are using function that uses NSE.
Therefore I don't think it is reasonable to write such a long paragraph explaining that misuse.
Rather what seems more appropriate is simply an example of grouping sets usage inside the function which also uses NSE.

by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()), id = TRUE)
}
# Using the wrapper function
custom_grouping(DT, multiplier = 2)
local({ foo <- 3; custom_grouping(DT, multiplier = foo) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple example of usage of grouping sets inside the function that uses NSE

data

library(data.table)
n = 24L
set.seed(25)
DT = data.table(
    color = sample(c("green","yellow","red"), n, TRUE),
    year = as.Date(sample(paste0(2011:2015,"-01-01"), n, TRUE)),
    status = as.factor(sample(c("removed","active","inactive","archived"), n, TRUE)),
    amount = sample(1:5, n, TRUE),
    value = sample(c(3, 3.5, 2.5, 2), n, TRUE)
)

usage

mygrp = function(x, j) {
  groupingsets(x, jj = substitute(j),
               by = c("color", "year", "status"),
               sets = list("color", c("year", "status"), character()), id = TRUE)
}
multiplier = 2
mygrp(DT, .(count = .N, total = sum(value) * multiplier))

Something as simple as this won't do?

@@ -81,5 +81,25 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"
# groupingsets
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"),
sets = list("color", c("year","status"), character()), id=TRUE)

# Using jj argument with quote() for direct calls
groupingsets(DT, jj = quote(c(list(count=.N), lapply(.SD, sum))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the advantage of using jj in this example over j?

if we use j we have one function call less: quote()
examples like this may be confusing as I don't see good reasons to use it like that. If jj is constructed programatically, or passed from outer function, then yes jj is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response; our practical exams are ongoing. We initially introduced jj to highlight how expressions referencing variables in an outer environment can be effectively captured. But if you feel no need for this as of now so keeping one wrapper-function example demonstrating how jj and substitute() handle outer-environment variables, and remove the extra jj examples for direct calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"highlight how expressions referencing variables in an outer environment can be effectively captured" complete example of it should make it more clear for me and other readers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion in my previous message. I misspoke – the jj=quote(...) examples don't actually capture outer variables. The wrapper function example using jj=substitute(j) is the one that correctly demonstrates capturing variables(multiplier) from the outer environment.
I'll remove the confusing jj=quote(...) examples and keep only the wrapper function example, as that clearly shows the intended use case for handling outer variables.

NEWS.md Outdated
@@ -27,6 +27,8 @@
+ Argument `in.place` to `droplevels` has been removed.
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0.

3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr, list(param=value)) instead, which creates an expression with values directly substituted at call time, following R's evaluation rules (Chapter 6 of the R Language Definition manual). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering changes in this PR, is this news description still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update news after you approve the final example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupingsets() wrong scoping logic
4 participants