-
Notifications
You must be signed in to change notification settings - Fork 1k
Document jj scoping behavior in groupingsets() with practical examples #6879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Document jj scoping behavior in groupingsets() with practical examples #6879
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6879 +/- ##
=======================================
Coverage 98.59% 98.59%
=======================================
Files 79 79
Lines 14661 14685 +24
=======================================
+ Hits 14455 14479 +24
Misses 206 206 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, please see if quote() won't be sufficient where substitute was used
NEWS.md
Outdated
@@ -27,6 +27,8 @@ | |||
+ Argument `in.place` to `droplevels` has been removed. | |||
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0. | |||
|
|||
3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr) instead, which captures both the expression and its environment per R’s scoping rules (§6). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what exactly (§6) refers to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It represents chapter , I'll update it to ch for better readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chapter 6 is from R language manual, one of the few official R manuals (R lang, R admin, WRE, ...), therefore specifying that would be desired as well.
man/groupingsets.Rd
Outdated
@@ -81,5 +83,15 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status" | |||
# groupingsets | |||
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"), | |||
sets = list("color", c("year","status"), character()), id=TRUE) | |||
# Using jj argument(enables access to variables from outer environments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space is missing
man/groupingsets.Rd
Outdated
@@ -81,5 +83,15 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status" | |||
# groupingsets | |||
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"), | |||
sets = list("color", c("year","status"), character()), id=TRUE) | |||
# Using jj argument(enables access to variables from outer environments) | |||
groupingsets(DT, jj = substitute(c(list(count=.N), lapply(.SD, sum))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't quote() be sufficient here? Substitute will be good when using from inside a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right that substitute will be good when using from inside a function and as substitution is necessary when capturing variables from the function's environment but here all variables are within data.table(like column)
But after taking a closer look at groupingsets() implementation to check whether quote() would work instead of substitute() in which after reviewing the code, I think substitute() is indeed the right approach as the whole codebase consistently use substitute()
here are snippets in which they behave the same way whether direcly calling groupingsets() or using functions like rollup() or cube()
jj = if (!missing(jj)) jj else substitute(j) and jj = substitute(j)
this code evaluates eval(jj) without specifying environment , which works correctly as it brings both the expression and its environment.
So I think using substitute() keeps things consistent with how the functions are built to work, even if quote() might be sufficient here .
But I'm open to use quote() in place of substitute() in examples if you feel it would be clearer for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If quote() will be sufficient then it should be used instead of substitute because it will be as simple as necessary, substitute is more powerful and may create impression that it is necessary there. While it is necessary only from inside a function, that is why is used in rollup and cube. One example of function wrapping groupingsets and use of substitute would therefore be useful as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your clarification. I updated the description of jj usage - include when to use quote() and when substitute() and examples to use quote() for the direct calls to groupingsets().
I also added an example showing when substitute() is actually needed - inside a wrapper function where variables from the outer environment need to be captured.
Could you please review these changes and let me know if any further modifications are needed? Thank you!
NEWS.md
Outdated
@@ -27,6 +27,8 @@ | |||
+ Argument `in.place` to `droplevels` has been removed. | |||
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0. | |||
|
|||
3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr) instead, which captures both the expression and its environment per R’s scoping rules (Chapter 6 of the R Language Definition manual). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substitute()
does not capture the environment (that's the problem that quosures are intended to solve). What it does is access the list or environment given to substitute()
as an argument (which defaults to the caller's environment) at the time of the call. Any symbol that can be directly found in that environment (ignoring the parent chain) is substituted: promises are replaced with the promise expression (losing the promise environment in the process!), other values are substituted directly.
man/groupingsets.Rd
Outdated
# Example showing when substitute() is necessary - in a wrapper function | ||
custom_grouping <- function(DT, multiplier = 1) { | ||
# Here substitute() is needed to capture 'multiplier' from function environment | ||
groupingsets(DT, jj = substitute(list(count = .N, total = sum(value) * multiplier)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, a plain substitute()
is not a reliable solution here. Consider:
local({ foo <- 3; custom_grouping(DT, multiplier=foo) })
substitute(list(count = .N, total = sum(value) * multiplier))
expands the multiplier
promise into foo
, which is not a variable that custom_grouping
is supposed to know about, so groupingsets
fails to find foo
(since it lives in a local environment that environment(custom_grouping)
does not inherit from).
What does work is substitute(list(count = .N, total = sum(value) * multiplier), list(multiplier = multiplier))
, because with a list
argument to substitute
, multiplier
is a plain value, not an argument promise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining the nuance about substitute() with nested environments. I'll implement your suggested approach using substitute(expr, list(param=value)) in the example to properly handle the multiplier parameter regardless of call context.
Hi @aitap @jangorecki, |
man/groupingsets.Rd
Outdated
sets = list("color", c("year", "status"), character()), id = TRUE) | ||
} | ||
# Using the wrapper function | ||
custom_grouping(DT, multiplier = 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not much gain providing value to be substituted. Shouldn't this be a symbol to be substituted? For current example j argument will work well enough, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right , here symbol from outer environment should be used . I updated this example to use this .
man/groupingsets.Rd
Outdated
by = c("color", "year", "status"), | ||
sets = list("color", c("year", "status"), character()), id = TRUE) | ||
} | ||
# Using the wrapper function | ||
custom_grouping(DT, multiplier = 2) | ||
local({ foo <- 3; custom_grouping(DT, multiplier = foo) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is overly complex now. We don't need local(), just something similar to cube/rollup, where user can pass column names as symbols (rather than a value as it is now) to his/her wrapper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so does by simply replacing multiplier = 2, to a parameter holding a symbol works here ? , by converting amount to symbol in groupingsets .
Here amount is defined early before the examples as one of the parameter like color , value and status in data.table.
For eg .
custom_grouping <- function(DT, value_col) {
groupingsets(DT, jj = substitute(list(count = .N, total = sum(col)),
list(col = as.name(value_col))),
by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()),
id = TRUE)
}
custom_grouping(DT, value_col = "amount")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you would like an example of function that substitutes a j
argument and gives it to groupingsets
as jj
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd like to provide an example as I believe that since j evaluates expressions directly in groupingsets()'s environment, it can't resolve dynamic symbols like as.name(value_col). Using substitute() for jj replaces the placeholder with the actual column name (amount), ensuring that sum(amount) correctly operates on the data.
I hope that makes sense—I'd appreciate hearing your thoughts on whether this seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mukulyadav2004 It would be great!
I was going to ask for such an example, just using substitute2
instead of substitute
and making explicit that list(col = as.name(value_col))
is the env
argument: env = list(col = as.name(value_col))
.
Thank you for this work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. You're right that this function doesn't use NSE since the input is a string. But as per requirement of example that uses substitute2()
with an explicit env = list(col = as.name(value_col))
, here below is complete example-
custom_grouping <- function(DT, value_col) {
groupingsets(DT, jj = substitute2(list(count = .N, total = sum(col)), env = list(col = as.name(value_col))),
by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()),
id = TRUE
)
}
custom_grouping(DT, "amount")
I hope this aligns with your intent, but if there's a better way you’d suggest approaching this, I’d really appreciate your guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this one a better way?
custom_grouping <- function(DT, value_col) {
groupingsets(DT, jj = substitute2(list(count = .N, total = sum(col)), env = list(col = value_col)),
by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()),
id = TRUE
)
}
custom_grouping(DT, "amount")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the use of substitute2()
here, one function call less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent is to use an example function that accepts a j
as an argument and then forwards it as the jj
argument to groupingsets
. Let's not document the workaround of using substitute
to insert j
's dependencies into the expression (it won't help a wrapper that accepts an arbitrary j
); there is a more reliable fix for the underlying problem in #6899.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I’ve added example that accepts an arbitrary j and forwards it with substitute(j)
to groupingsets()
, now that #6899 handles the environment correctly, this removes need for relying on substitute2()
or env
workarounds. Let me know if you'd like any further adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got some time to elaborate a bit more
man/groupingsets.Rd
Outdated
@@ -34,6 +34,8 @@ groupingsets(x, \dots) | |||
The \code{label} argument can be a named list of scalars, or a scalar, or \code{NULL}. When \code{label} is a list, each element name must be (1) a variable name in \code{by}, or (2) the first element of the class in the data.table \code{x} of a variable in \code{by}, or (3) one of 'character', 'integer', 'numeric', 'factor', 'Date', 'IDate'. The order of the list elements is not important. A label specified by variable name will apply only to that variable, while a label specified by first element of a class will apply to all variables in \code{by} for which the first element of the class of the variable in \code{x} matches the \code{label} element name, except for variables that have a label specified by variable name (that is, specification by variable name takes precedence over specification by class). For \code{label} elements with name in \code{by}, the class of the label value must be the same as the class of the variable in \code{x}. For \code{label} elements with name not in \code{by}, the first element of the class of the label value must be the same as the \code{label} element name. For example, \code{label = list(integer = 999, IDate = as.Date("3000-01-01"))} would produce an error because \code{class(999)[1]} is not \code{"integer"} and \code{class(as.Date("3000-01-01"))[1]} is not \code{"IDate"}. A corrected specification would be \code{label = list(integer = 999L, IDate = as.IDate("3000-01-01"))}. | |||
|
|||
The \code{label = <scalar>} option provides a shorter alternative in the case where only one class of grouping variable requires a label. For example, \code{label = list(character = "Total")} can be shortened to \code{label = "Total"}. When this option is used, the label will be applied to all variables in \code{by} for which the first element of the class of the variable in \code{x} matches the first element of the class of the scalar. | |||
|
|||
The \code{jj} argument provides a mechanism to handle scoping issues when working with variables from outer environments. When using \code{j} directly with variables not defined in the columns of the data.table, R's evaluation rules may not find these variables, resulting in "object not found" errors. The \code{jj} argument accepts a quoted expression which can be created with either \code{quote()} (for direct calls referencing only columns and special symbols) or \code{substitute()} (when inside functions that need to capture parameters from the outer environment). For reliable behavior with function parameters, especially in nested environments, use the two-argument form \code{substitute(expr, list(param=value))} which binds values rather than symbols. This approach is particularly useful in functions where parameters need to be referenced within aggregation expressions. When \code{jj} is provided, the function will ignore any value specified in the \code{j} argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of jj
in the manual is as follows
quoted version of j argument, for convenience. When provided function will ignore j argument.
It sounds quite different than what was described in your paragraph here.
Your description focuses on a single example of misuse of jj
. It is not really related to data.table's implementation here, you will face the same problem whenever you are using function that uses NSE.
Therefore I don't think it is reasonable to write such a long paragraph explaining that misuse.
Rather what seems more appropriate is simply an example of grouping sets usage inside the function which also uses NSE.
man/groupingsets.Rd
Outdated
by = c("color", "year", "status"), | ||
sets = list("color", c("year", "status"), character()), id = TRUE) | ||
} | ||
# Using the wrapper function | ||
custom_grouping(DT, multiplier = 2) | ||
local({ foo <- 3; custom_grouping(DT, multiplier = foo) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple example of usage of grouping sets inside the function that uses NSE
data
library(data.table)
n = 24L
set.seed(25)
DT = data.table(
color = sample(c("green","yellow","red"), n, TRUE),
year = as.Date(sample(paste0(2011:2015,"-01-01"), n, TRUE)),
status = as.factor(sample(c("removed","active","inactive","archived"), n, TRUE)),
amount = sample(1:5, n, TRUE),
value = sample(c(3, 3.5, 2.5, 2), n, TRUE)
)
usage
mygrp = function(x, j) {
groupingsets(x, jj = substitute(j),
by = c("color", "year", "status"),
sets = list("color", c("year", "status"), character()), id = TRUE)
}
multiplier = 2
mygrp(DT, .(count = .N, total = sum(value) * multiplier))
Something as simple as this won't do?
man/groupingsets.Rd
Outdated
@@ -81,5 +81,25 @@ cube(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status" | |||
# groupingsets | |||
groupingsets(DT, j = c(list(count=.N), lapply(.SD, sum)), by = c("color","year","status"), | |||
sets = list("color", c("year","status"), character()), id=TRUE) | |||
|
|||
# Using jj argument with quote() for direct calls | |||
groupingsets(DT, jj = quote(c(list(count=.N), lapply(.SD, sum))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the advantage of using jj
in this example over j
?
if we use j
we have one function call less: quote()
examples like this may be confusing as I don't see good reasons to use it like that. If jj
is constructed programatically, or passed from outer function, then yes jj
is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response; our practical exams are ongoing. We initially introduced jj
to highlight how expressions referencing variables in an outer environment can be effectively captured. But if you feel no need for this as of now so keeping one wrapper-function example demonstrating how jj
and substitute()
handle outer-environment variables, and remove the extra jj
examples for direct calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"highlight how expressions referencing variables in an outer environment can be effectively captured" complete example of it should make it more clear for me and other readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion in my previous message. I misspoke – the jj=quote(...)
examples don't actually capture outer variables. The wrapper function example using jj=substitute(j)
is the one that correctly demonstrates capturing variables(multiplier
) from the outer environment.
I'll remove the confusing jj=quote(...)
examples and keep only the wrapper function example, as that clearly shows the intended use case for handling outer variables.
NEWS.md
Outdated
@@ -27,6 +27,8 @@ | |||
+ Argument `in.place` to `droplevels` has been removed. | |||
+ It's now an error to set `datatable.nomatch`, which has been warning since 1.15.0. | |||
|
|||
3. The groupingsets() documentation now explicitly addresses scoping issues when using variables from outer environments (e.g., function parameters) in aggregation expressions. Previously, using such variables in j could lead to "object not found" errors due to evaluation environment mismatches. This is resolved by using jj = substitute(expr, list(param=value)) instead, which creates an expression with values directly substituted at call time, following R's evaluation rules (Chapter 6 of the R Language Definition manual). Three new examples demonstrate proper usage, including function parameter references (e.g., sum(value > threshold)) and multi-aggregation patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering changes in this PR, is this news description still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update news after you approve the final example
closes #5560
In this PR I improved documentation for the jj argument in groupingsets(), explaining how jj = substitute(expr) preserves both expression and environment context when j would fail with outer-environment variables.
The documentation now explains why users encounter "object not found" errors when using variables from outer environments with j, and how jj solves this problem.
Key changes in groupingsets.Rd:
1.)Added detailed explanation of scoping behavior in the \details section
2.)Added three practical examples demonstrating different jj usage patterns:
Basic usage with list and lapply
Simple aggregation
Named results with direct column references
These changes directly address the request for "more examples on the use of the jj argument" and will help users understand when and how to use this parameter in their own functions.