Can i recover the aesthetic mapping from within a `compute_*()` step? #68

corybrunson · 2024-12-14T21:59:48Z

corybrunson
Dec 14, 2024

Hi everyone! This is something i've been banging my head over for weeks now. I feel confident that it's possible with limited inelegance but i've not found a solution via trial-and-error or SO (see, for example, this Q&A that almost gets me there). I hope it's an OK thing to post here, in particular before posting it to SO.

The idea is to build a statistical transformation layer that uses two data sets, one that it directly transforms and another that the transformation is done with respect to. What i'd like to be able to do is transform the inherited data with respect to an additional referent data set having the same original variables as data passed (presumably but not necessarily as a parameter) to the layer. The layer should then pass the referent through the inherited aesthetic mappings before computation (presumably but not necessarily in $setup_params()).

In the example below, the referent data has only one row and the inherited data is transformed to the endpoints of the segments of their projections onto the axis of the referent data. I can make the plot i want by manually applying the aesthetic mappings, but i'd like to have this done internally.

Feel free to expound on how this is a bad idea that i should not pursue, if that's how you feel. : ) For the curious, this is intended for {ordr}, as drafted for stat_rule() in the offset branch.

library(ggplot2)

StatProj <- ggproto(
  "StatProj", Stat,
  
  required_aes = c("x", "y"),
  
  setup_data = function(data, params) {
    
    # apply aesthetic mapping to data
    if (is.null(params$referent[["x"]]) || is.null(params$referent[["y"]]))
      stop("I want to apply an inherited aesthetic mapping, presumably here.")
    
    data
  },
  
  compute_group = function(self, data, scales, referent = NULL, na.rm = FALSE) {
    
    # arbitrary values of computed aesthetics
    res <- transform(
      data,
      xend = NA_real_,
      yend = NA_real_
    )
    # empty initialized output
    res <- data[c(), , drop = FALSE]
    
    # no referent means no projection
    if (is.null(referent) || ! is.data.frame(referent)) return(res)
    
    # compute and collect projections of `data` onto `referent` rows
    inertias <- referent$x^2 + referent$y^2
    for (i in seq(nrow(referent))) {
      data$dots <- data$x * referent$x[i] + data$y * referent$y[i]
      res_i <- transform(
        data,
        xend = dots / inertias[i] * referent$x[i],
        yend = dots / inertias[i] * referent$y[i]
      )
      res <- rbind(res, res_i)
    }
    
    # output segment data
    res
  }
)

stat_proj <- function(
    mapping = NULL, data = NULL, geom = "segment", position = "identity",
    show.legend = NA,
    inherit.aes = TRUE,
    referent = NULL,
    ...
) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatProj,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      referent = referent,
      na.rm = FALSE,
      ...
    )
  )
}

# Simplify the Motor Trends data to two predictors legible at aspect ratio 1.
mtcars |>
  transform(hp00 = hp/100) |>
  subset(select = c(mpg, hp00, wt)) ->
  subcars
# Compute the gradient of `mpg` against these two predictors.
lm(mpg ~ hp00 + wt, subcars) |>
  coefficients() |>
  as.list() |> as.data.frame() ->
  grad

# Here's the setup; i want to project the data points onto the gradient axis.
ggplot(subcars, aes(x = hp00, y = wt)) +
  coord_equal() +
  geom_point() +
  geom_segment(data = grad, aes(xend = 0, yend = 0))

# This doesn't work, but i want it to.
ggplot(subcars, aes(x = hp00, y = wt)) +
  coord_equal() +
  geom_point() +
  geom_segment(data = grad, aes(xend = 0, yend = 0)) +
  stat_proj(referent = grad)
#> Error in `stat_proj()`:
#> ! Problem while computing stat.
#> ℹ Error occurred in the 3rd layer.
#> Caused by error in `setup_data()`:
#> ! I want to apply an inherited aesthetic mapping, presumably here.
# This works, but i don't want it to.
ggplot(subcars, aes(x = hp00, y = wt)) +
  coord_equal() +
  geom_point() +
  geom_segment(data = grad, aes(xend = 0, yend = 0)) +
  stat_proj(referent = transform(grad, x = hp00, y = wt))

^{Created on 2024-12-14 with reprex v2.1.1}

Answered by teunbrand

Dec 15, 2024

A variant of the approach above is that you ggproto_parent(old_layer, self)$compute_statistic() after you've copied the computed mapping. Then you'd be unaffected by changes that ggplot2 would make in Layer$compute_statistic().

Another approach could be to append an extra class to your layer, and use a custom ggplot_add() method that sort of mirrors Layer$setup_data() to combine global and local mappings, and pass that on as a layer parameter. This is 'sanctioned' to some degree, but a more roundabout way to getting to the same state.

View full answer

teunbrand · 2024-12-14T23:13:43Z

teunbrand
Dec 14, 2024
Maintainer

Is it important to you that this stat can be freely combined with arbitrary geoms? In other words, can we pull tricks in the stat_proj() constructor function? If this is not important and we can pull tricks, I bet you we can get layer-controlled variables like the mapping into the stat parameters somehow.

8 replies

teunbrand Dec 15, 2024
Maintainer

With problems like these, I usually like to whittle down the problem to the most minimal mechanic. So here is a stat that just reports on a specific parameter. If we can get this to print the mapping, we should have figured it out.

library(ggplot2)

# A dummy reporter stat
MyStat <- ggproto(
  "MyStat", StatIdentity,
  setup_data = function(self, data, params) {
    print(params$mapping)
    data
  }
)

Now the goal is the get the mapping into the parameters.
In the part below, we're making a layer the usual way, but then we're extending it so we include the computed mapping as a stat parameter.

my_stat <- function(..., geom = "point", position = "identity") {
  # I'd do the proper layer-building ritual here, this is just for brevity
  my_layer <- layer(stat = MyStat, geom = geom, position = position, ...)
  
  ggproto(
    # Here we're making a copy of the LayerInstance resulting from the geom call
    NULL, my_layer,
    
    # We replace the stat computation step by one that appends the mapping
    # to the computed stat parameters
    compute_statistic = function(self, data, layout) {
      params <- self$stat$setup_params(data, self$stat_params)
      # The following line is added
      self$computed_stat_params <- params[["mapping"]] <- self$computed_mapping
      data <- self$stat$setup_data(data, params)
      self$stat$compute_layer(data, params, layout)
    }
  )
}

When we take it for a spin we see that the mapping prints:

p <- ggplotGrob(ggplot(mpg, aes(displ, hwy, colour = drv)) + my_stat())
#> Aesthetic mapping: 
#> * `x`      -> `displ`
#> * `y`      -> `hwy`
#> * `colour` -> `drv`

^{Created on 2024-12-15 with reprex v2.1.1}

corybrunson Dec 15, 2024
Author

OK, i can replicate this and i think i have the sense of direction to take it where it needs to go! Thank you.

This is a bit unnerving, in that the method requiring overhaul ($compute_statistic()) is not documented in {ggplot2}—correct? So this really is something that the core package is not meant to support. That's too bad—i was hoping that this could follow some kind of standard practice. Are there any thoughts on that—maybe whether all Layer methods should be documented, but more to the point whether it would be useful to categorize extensions—among other criteria—by whether they use standard versus nonstandard techniques?

(I also think i should be doing this in $setup_params() rather than in $setup_data(); it's just that my most recent trial-and-error used the latter. But that seems to pose no additional problems.)

teunbrand Dec 15, 2024
Maintainer

Yeah this definitely is not an 'advertised extension point'. The Layer ggproto object is not exported, which indicates that it is fully intended as an internal structure. My official standpoint as somebody who maintains ggplot2 is also that 'I don't endorse this', but my unofficial standpoint is that not everything fits into ggplot2's mold and you are allowed to deviate when you accept your code might become broken and you think the usefulness of the feature outweighs the unorthodoxy of your approach. At least that is how ggh4x got started.

teunbrand Dec 15, 2024
Maintainer

A variant of the approach above is that you ggproto_parent(old_layer, self)$compute_statistic() after you've copied the computed mapping. Then you'd be unaffected by changes that ggplot2 would make in Layer$compute_statistic().

Another approach could be to append an extra class to your layer, and use a custom ggplot_add() method that sort of mirrors Layer$setup_data() to combine global and local mappings, and pass that on as a layer parameter. This is 'sanctioned' to some degree, but a more roundabout way to getting to the same state.

Answer selected by corybrunson

yjunechoe Dec 15, 2024
Collaborator

not everything fits into ggplot2's mold and you are allowed to deviate when you accept your code might become broken and you think the usefulness of the feature outweighs the unorthodoxy of your approach.

I think there's also a side to this discussion which is: "Assuming you take the route of deviating from the advertised extension mechanism and extending an unofficial extension point, what is the most principled way to maintain this?"

Adding to Teun's more specific suggestions about implementation, broadly I like to think that most basic unit-testing principles apply. So a change should be as targeted and as up-stream as possible, among other things. I think many "early" approaches to such hacks (predating ggh4x 😛) tended to go for backwards retrieval/reconstruction of information from the ggplot_built or the gtable, which made them extremely brittle to changes in ggplot internals (my hot take is they might have even been less brittle and painful to debug if they had just extended an unofficial extension point at the appropriate place upstream instead). I think it also helps to have a local setup for tests to run both on the CRAN release and the dev version of ggplot -- I do this for ggtrace so that I can anticipate and catch any breakages (for which the onus is on me to fix).

I also enjoy taking advantage of the fact that ggplot moves slow and dev discussions happen publicly. If an update were to touch a Layer method in a substantial way, I'd like to think that that discussion would start as an issue on gh at least two versions prior to its eventual implementation. IMO Layer methods, specifically, are kind of funny in that they're simultaneously the most untouchable as they are deeply hidden away in the internals by design, but due to their coreness they're also the most inert and therefore feel pretty stable to build things off of (vs., say, even ggplot_add()).

Not that I'm endorsing any of this either, but it's encouraging to see that ggplot has leaned into the layer methods as not just falling out as a byproduct of implementing the grammar in the code but also having a conceptual standing of their own, which in part seems to have emerged in a sort of a bottom-up fashion. So ex: $map_statistic() is where you "map stat to aesthetics", and being attributed high-level semantics in that way is a good sign to me that the function/method is unlikely to change their behavior in the system drastically 😄

corybrunson Dec 15, 2024
Author

Then you'd be unaffected by changes that ggplot2 would make in Layer$compute_statistic().

That's exactly what i was going to ask next! I reviewed the extensions chapter of the book and was reminded of ggproto_parent(). Whatever the trick, i'd like to make it as robust to {ggplot2} upgrades as possible—for my own convenience, sure, but also to model good practice.

This is 'sanctioned' to some degree, but a more roundabout way to getting to the same state.

Speaking of good practice, this seems like it might provide a better general template from which to write several layers of this form (which is the plan). I'll search for examples, but are there any extensions you'd point to that do a good job writing new methods for ggplot_add()?

teunbrand Dec 15, 2024
Maintainer

but are there any extensions you'd point to that do a good job writing new methods for ggplot_add()?

I know that {gghighlight} has a complicated one out of necessity, as it needs to touch multiple aspects of a plot. Generaly I think the simpler they are the better. Aside from that, I think ggplot_add() is a sort of last resort if ggplot2 doesn't provide any infrastructure on the problem you're solving. Which seems to be your case.

For the purpose of getting the mapping, you could write something like the following:

ggplot_add.my_custom_layer <- function(object, plot, object_name) {
  mapping <- plot$mapping
  mapping[names(object$mapping)] <- object$mapping
  object$stat_params$mapping <- mapping
  NextMethod()
}

EvaMaeRey · 2024-12-15T05:50:40Z

EvaMaeRey
Dec 15, 2024
Maintainer

Interesting problem! I can't contribute to an answer, but I did want to offer my preference for this kind of reprex - use existing user-facing function when possible with new Stat, i.e. geom_segment(stat = StatProj), instead of writing your own user-facer. Just saving some typing (or search, copy, paste). Example with less interesting StatProjoutcome, compared with StatProj .... 🎈

library(tidyverse)

# Simplify the Motor Trends data to two predictors legible at aspect ratio 1.
mtcars |>
  transform(hp00 = hp/100) |>
  subset(select = c(mpg, hp00, wt)) ->
subcars

head(subcars)
#>                    mpg hp00    wt
#> Mazda RX4         21.0 1.10 2.620
#> Mazda RX4 Wag     21.0 1.10 2.875
#> Datsun 710        22.8 0.93 2.320
#> Hornet 4 Drive    21.4 1.10 3.215
#> Hornet Sportabout 18.7 1.75 3.440
#> Valiant           18.1 1.05 3.460

# Here's the setup; i want to project the data points onto the gradient axis.
ggplot(subcars) +
  aes(x = hp00, y = wt) +
  coord_equal() +
  geom_point()

compute_group_projoutcome <- function(data, scales, na.rm = FALSE) {
    
    # arbitrary values of computed aesthetics
    res <- transform(
      data,
      xend = NA_real_,
      yend = NA_real_
    )
    # empty initialized output
    res <- data[c(), , drop = FALSE]
    
    lm(outcome ~ x + y, data) |>
         coefficients() |>
          as.list() |> as.data.frame() ->
    gradient
    
    # no referent means no projection
    if (is.null(gradient) || ! is.data.frame(gradient)) return(res)
    
    # compute and collect projections of `data` onto `referent` rows
    inertias <- gradient$x^2 + gradient$y^2
    for (i in seq(nrow(gradient))) {
      data$dots <- data$x * gradient$x[i] + data$y * gradient$y[i]
      res_i <- transform(
        data,
        xend = dots / inertias[i] * gradient$x[i],
        yend = dots / inertias[i] * gradient$y[i]
      )
      res <- rbind(res, res_i)
    }
    
    res
}


subcars %>% 
  rename(x = hp00, y = wt, outcome = mpg) %>% 
  compute_group_proj() %>% 
  head()
#>                   outcome    x     y      dots     xend     yend
#> Mazda RX4            21.0 1.10 2.620 -13.65494 1.726263 2.106873
#> Mazda RX4 Wag        21.0 1.10 2.875 -14.64379 1.851273 2.259445
#> Datsun 710           22.8 0.93 2.320 -11.95145 1.510907 1.844035
#> Hornet 4 Drive       21.4 1.10 3.215 -15.96225 2.017954 2.462876
#> Hornet Sportabout    18.7 1.75 3.440 -18.90000 2.389346 2.916153
#> Valiant              18.1 1.05 3.460 -16.75345 2.117978 2.584954

StatProjoutcome <- ggproto("StatProjoutcome", Stat,
  required_aes = c("x", "y", "outcome"),
  compute_group = compute_group_projoutcome
)

last_plot() +
  geom_segment(stat = StatProjoutcome, 
               aes(outcome = mpg))

^{Created on 2024-12-14 with reprex v2.1.0}

2 replies

corybrunson Dec 15, 2024
Author

This took me a moment to digest—the user-facer you're referring to is stat_proj(), right? That's good advice, thank you—i am very much in the habit of pairing new Stat* and Geom* ggprotos with stat_*() and geom_*() ... layer-constructors? But they add complication that may not be necessary.

In this case, i experimented a bit with steps inside the stat_*(), like binding the inherited data to the referent data, though i think that failed because the inherited data was already mapped. One question i thought about asking previously is whether it's advisable to perform operations inside stat_*() before layer() is called; it seemed risky and i couldn't find examples of it. So, especially if that's generally discouraged, then it makes sense to limit experimental code (and reproducible examples) to the ggprotos if possible.

EvaMaeRey Dec 15, 2024
Maintainer

the user-facer you're referring to is stat_proj()

yes.

In cases like these, you should of course pay most attention to @teunbrand's advice!
This subthread is really tangential to your problem.

I do like how he's pairing back the layer definition:

my_layer <- layer(stat = MyStat, geom = geom, position = position, ...)

... you are right that geom_segment could introduce some unknowns.

A more pure version of the ad-hoc, visual test of the Stat (where us non-profies are likely to start) that skips defining your own layer function might look like this:

ggplot(subcars, aes(x = hp00, y = wt)) +
  coord_equal() +
  geom_point() +
  geom_segment(data = grad, aes(xend = 0, yend = 0)) +
  layer(geom = "segment", stat = StatProj, position = "identity", 
            params = list(referent = transform(grad, x = hp00, y = wt)))

And uber tangential, I used to define functions first before getting visual sense of Stats that I'd write. I was really glad to change workflow to do visual inspection before defining my user-facing function. I recommend geom_*(stat = ) as well as the layer(geom = , stat = , position = ) moves here in an introductory context:
https://evamaerey.github.io/easy-geom-recipes/recipe1means.html#test-stat-group-wise-behavior

In the introductory material I start with a geom_*, a less pure approach, you don't need to type as much, and are dealing with familiar territory - trying to limit what new is thrown at people.

geom_segment(stat = StatProj, referent = transform(grad, x = hp00, y = wt))

I think geom_segment (like geom_point) is actually pretty vanilla definition, so maybe not a terrible place to start, and then move to layer() if there are concerns.

whether it's advisable to perform operations inside stat_*() before layer() is called

likely non-conventional, though I don't know if considered 'risky'.

Can i recover the aesthetic mapping from within a compute_*() step? #68

Uh oh!

corybrunson Dec 14, 2024

Replies: 2 comments · 10 replies

Uh oh!

Uh oh!

teunbrand Dec 14, 2024 Maintainer

Uh oh!

teunbrand Dec 15, 2024 Maintainer

Uh oh!

corybrunson Dec 15, 2024 Author

Uh oh!

Uh oh!

teunbrand Dec 15, 2024 Maintainer

Uh oh!

teunbrand Dec 15, 2024 Maintainer

Uh oh!

Uh oh!

yjunechoe Dec 15, 2024 Collaborator

Uh oh!

corybrunson Dec 15, 2024 Author

Uh oh!

Uh oh!

teunbrand Dec 15, 2024 Maintainer

Uh oh!

EvaMaeRey Dec 15, 2024 Maintainer

Uh oh!

Uh oh!

corybrunson Dec 15, 2024 Author

Uh oh!

Uh oh!

EvaMaeRey Dec 15, 2024 Maintainer

Can i recover the aesthetic mapping from within a `compute_*()` step? #68

corybrunson
Dec 14, 2024

Replies: 2 comments 10 replies

teunbrand
Dec 14, 2024
Maintainer

teunbrand Dec 15, 2024
Maintainer

corybrunson Dec 15, 2024
Author

teunbrand Dec 15, 2024
Maintainer

teunbrand Dec 15, 2024
Maintainer

yjunechoe Dec 15, 2024
Collaborator

corybrunson Dec 15, 2024
Author

teunbrand Dec 15, 2024
Maintainer

EvaMaeRey
Dec 15, 2024
Maintainer

corybrunson Dec 15, 2024
Author

EvaMaeRey Dec 15, 2024
Maintainer