dplyr 1.0.0

2020/04/07

Introduction

dplyr 1.0.0 is coming soon! I’m really excited about this release. I think rowwise and across are excellent new functions that make common operations feel more intutive. You can try them out for yourself by installing the development version from Github until there is a CRAN release.

This post is just me playing around with some of the new features.

Stop Repeating Names

I’ve probably typed something like this a thousand times:

library(ggplot2)
library(tidyr)
library(dplyr, warn.conflicts = FALSE)

mtcars <- as_tibble(mtcars)

mtcars %>%
  mutate(vs = as.factor(vs)) %>%
  mutate(am = as.factor(am))
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec vs    am     gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5 0     1         4     4
##  2  21       6  160    110  3.9   2.88  17.0 0     1         4     4
##  3  22.8     4  108     93  3.85  2.32  18.6 1     1         4     1
##  4  21.4     6  258    110  3.08  3.22  19.4 1     0         3     1
##  5  18.7     8  360    175  3.15  3.44  17.0 0     0         3     2
##  6  18.1     6  225    105  2.76  3.46  20.2 1     0         3     1
##  7  14.3     8  360    245  3.21  3.57  15.8 0     0         3     4
##  8  24.4     4  147.    62  3.69  3.19  20   1     0         4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9 1     0         4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3 1     0         4     4
## # … with 22 more rows

Here’s the general pattern: I’m applying a function to a column, but I want the column name to stay the same.

It also happens in summaries, where you want to find column means, but you just want the columns to be named the same:

mtcars %>%
  summarise(disp = mean(disp), hp = mean(hp))
## # A tibble: 1 x 2
##    disp    hp
##   <dbl> <dbl>
## 1  231.  147.

You are always repeating

column_name = func(column_name)

dplyr::across is a way to get around that. In mutates and summarises, by default across keeps the same column name after you apply a function:

mtcars %>%
  mutate(across(vs, as.factor))
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec vs       am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5 0         1     4     4
##  2  21       6  160    110  3.9   2.88  17.0 0         1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6 1         1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4 1         0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0 0         0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2 1         0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8 0         0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20   1         0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9 1         0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3 1         0     4     4
## # … with 22 more rows

This is a nice, but small benefit. The cool thing is that you can easily apply this to any number of columns you want using tidyselect and functions.

mtcars %>%
  summarise(across(c(disp, hp), mean))
## # A tibble: 1 x 2
##    disp    hp
##   <dbl> <dbl>
## 1  231.  147.
mtcars %>%
  summarise(across(is.numeric & -disp & -hp, mean))
## # A tibble: 1 x 9
##     mpg   cyl  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  6.19  3.60  3.22  17.8 0.438 0.406  3.69  2.81

Summary Rows

One nice thing about Excel is summary rows.

For example, say I have the following summarised table:

diamonds_sum <- diamonds %>%
  mutate(across(cut, as.character)) %>%
  group_by(cut) %>%
  summarise(across(price, sum), across(carat, n_distinct), n = n())

diamonds_sum
## # A tibble: 5 x 4
##   cut          price carat     n
## * <chr>        <int> <int> <int>
## 1 Fair       7017600   185  1610
## 2 Good      19275009   199  4906
## 3 Ideal     74513487   232 21551
## 4 Premium   63221498   251 13791
## 5 Very Good 48107623   231 12082

In Excel, I could easily add a total row for each column. You can do it with dplyr, but it doesn’t feel great.

sum_of_sum <- diamonds_sum %>%
  summarise(across(-cut, sum))

sum_of_sum
## # A tibble: 1 x 3
##       price carat     n
##       <int> <int> <int>
## 1 212135217  1098 53940
diamonds_sum %>%
  bind_rows(
    sum_of_sum %>%
      mutate(cut = "total")
  )
## # A tibble: 6 x 4
##   cut           price carat     n
## * <chr>         <int> <int> <int>
## 1 Fair        7017600   185  1610
## 2 Good       19275009   199  4906
## 3 Ideal      74513487   232 21551
## 4 Premium    63221498   251 13791
## 5 Very Good  48107623   231 12082
## 6 total     212135217  1098 53940

It’s a somewhat common problem. Here’s a StackOverflow post that documents various solutions.

Funny enough, you can actually do this in the new dplyr using rowwise. The intution here is to tranpose the dataframe and think of price, carat and n as rows. Then we can add a rowwise summary, tranpose back and get our total row!

So we start by transposing the dataframe:

transpose_df <- function(x, col, name) {
  x %>%
    pivot_longer(-{{ col }}, names_to = name) %>%
    pivot_wider(names_from = {{ col }})
}

diamonds_sum %>%
  transpose_df(cut, "feature")
## # A tibble: 3 x 6
##   feature    Fair     Good    Ideal  Premium `Very Good`
##   <chr>     <int>    <int>    <int>    <int>       <int>
## 1 price   7017600 19275009 74513487 63221498    48107623
## 2 carat       185      199      232      251         231
## 3 n          1610     4906    21551    13791       12082

Then use rowwise to add a summary column for each feature:

diamonds_sum %>%
  transpose_df(cut, "feature") %>%
  rowwise(feature) %>%
  mutate(total = sum(c_across()))
## # A tibble: 3 x 7
## # Rowwise:  feature
##   feature    Fair     Good    Ideal  Premium `Very Good`     total
##   <chr>     <int>    <int>    <int>    <int>       <int>     <int>
## 1 price   7017600 19275009 74513487 63221498    48107623 212135217
## 2 carat       185      199      232      251         231      1098
## 3 n          1610     4906    21551    13791       12082     53940

Then we can transpose again to get back to our original shape:

diamonds_sum %>%
  transpose_df(cut, "feature") %>%
  rowwise(feature) %>%
  mutate(total = sum(c_across())) %>%
  transpose_df(feature, "cut")
## # A tibble: 6 x 4
##   cut           price carat     n
##   <chr>         <int> <int> <int>
## 1 Fair        7017600   185  1610
## 2 Good       19275009   199  4906
## 3 Ideal      74513487   232 21551
## 4 Premium    63221498   251 13791
## 5 Very Good  48107623   231 12082
## 6 total     212135217  1098 53940

This is also instructive to see how we can tranpose data using the pivot_* family of functions, and why it’s not so straight-forward to arbitrarily tranpose data. It’s easy to lose information!