The Tidyverse Curse

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars)
                  mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0 6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0 6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8 4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4 6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1 6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   print()
# A tibble: 32 × 11
   mpg   cyl  disp    hp  drat    wt  qsec    vs   am   gear  carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 21.0   6   160.0  110  3.90 2.620 16.46    0     1     4     4
 2 21.0   6   160.0  110  3.90 2.875 17.02    0     1     4     4
 3 22.8   4   108.0   93  3.85 2.320 18.61    1     1     4     1
 4 21.4   6   258.0  110  3.08 3.215 19.44    1     0     3     1
 5 18.7   8   360.0  175  3.15 3.440 17.02    0     0     3     2
 6 18.1   6   225.0  105  2.76 3.460 20.22    1     0     3     1
 7 14.3   8   360.0  245  3.21 3.570 15.84    0     0     3     4
 8 24.4   4   146.7   62  3.69 3.190 20.00    1     0     4     2
 9 22.8   4   140.8   95  3.92 3.150 22.90    1     0     4     2
10 19.2   6   167.6  123  3.92 3.440 18.30    1     0     4     4
# ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
Error in function_list[[i]](value) : 
 could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr")
> library("tibble")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
# A tibble: 32 × 12
 rowname            mpg   cyl disp    hp   drat   wt   qsec   vs    am   gear carb
  <chr>            <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4         21.0   6   160.0  110  3.90  2.620 16.46   0     1     4     4
2 Mazda RX4 Wag     21.0   6   160.0  110  3.90  2.875 17.02   0     1     4     4
3 Datsun 710        22.8   4   108.0   93  3.85  2.320 18.61   1     1     4     1
4 Hornet 4 Drive    21.4   6   258.0  110  3.08  3.215 19.44   1     0     3     1
5 Hornet Sportabout 18.7   8   360.0  175  3.15  3.440 17.02   0     0     3     2
6 Valiant           18.1   6   225.0  105  2.76  3.460 20.22   1     0     3     1
7 Duster 360        14.3   8   360.0  245  3.21  3.570 15.84   0     0     3     4
8 Merc 240D         24.4   4   146.7   62  3.69  3.190 20.00   1     0     4     2
9 Merc 230          22.8   4   140.8   95  3.92  3.150 22.90   1     0     4     2
10 Merc 280         19.2   6   167.6  123  3.92  3.440 18.30   1     0     4     4
# ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   print()
# A tibble: 1 × 1
 lyrics
 <chr>
1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   as.data.frame() %>%
+   print()
 ... <truncated>
1 a long long time ago i can still remember how that music used 
to make me smile and i knew if i had my chance that i could make 
those people dance and maybe theyd be happy for a while but 
february made me shiver with every paper id deliver bad news on 
the doorstep i couldnt take one more step i cant remember if i cried 
...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

This entry was posted in Data Science, R. Bookmark the permalink.

17 Responses to The Tidyverse Curse

  1. Mark Adamson says:

    Nice article. It feels like the tidyverse is a very useful dialect of R. I am happily becoming a native speaker of this dialect! I’m looking forward to the improvements coming to non standard evaluation as I think at the moment that is still quite confusing even for experienced developers.

  2. El-ad David Amir says:

    Agreed that `dplyr` and its friends are a brand new world as far as R is concerned.

    One possible solution to forgetting a package is to call `library(tidyverse)` rather than loading each package separately.

    • Bob Muenchen says:

      Hi El-ad,

      That’s definitely the way to go now that the tidyverse package exists. The dplyr package preceded it by a couple of years though, so many blog posts still exist showing just dplyr. I wanted to point out what happens if that’s all you load.

      Cheers,
      Bob

  3. Pingback: The Tidyverse Curse – Mubashir Qasim

  4. ntguardian says:

    Interesting post. You raise some interesting points. There’s a lot of R syntax that does not make sense but still exists because of backwards compatibility. I think the people creating R don’t want to pull a Python where a major break is made (from Python 2.7 to Python 3) and old code becomes invalidated, but the price for that is more complicated systems.

    So you think it’s still worth teaching the tidyverse to beginners? I did that the last time I taught the R lab at the University of Utah Math dept. Clearly the base system should be taught because that functionality is guaranteed, whereas tidyverse compatibility is not, and it is still a package that is not automatically installed and included with R. And as you said, the tidyverse is generally easier, but it makes R overall more difficult to learn since it is learning yet another way to do essentially the same thing.

    • Bob Muenchen says:

      Hi ntguardian,

      That’s a question I wrestled with for a long time. What I teach now is a mix. For example, I start with a complete but very simple program that reads data with readr’s read_csv, then modifies it with dplyr’s mutate, then does a couple of simple analyses. I then show what the program would look like using base r. I don’t really review the base R version except to point out how much more complex read.csv is, and to note that the program is filled with additional “mydata$” prefixes all over the place. I do cover lapply & sapply in my first workshop. Later when I get to a workshop on data management, I cover mutate_each and summarise_each instead. Whenever base R does something much easier, such as copy all the names from one data frame to another, I teach that along with dplyr’s rename function which is much easier when changing just a few names. I have the benefit that I can tell my workshop participants that all my tidyverse examples are repeated using base R in my books, R for SAS and SPSS Users, and R for Stata Users.

      Cheers,
      Bob

  5. Koma says:

    Why not glimpse()?

    • Bob Muenchen says:

      Hi Koma,

      I had a really hard time deciding what examples to include. glimpse vs. str is a contender. I thought about including an example of processing the output of group_by models. Learning the built-in approach is really complex. plyr made it somewhat easier (though slower), then dplyr made it way faster but blew everyone’s mind by storing entire models in the elements of a vector, then broom made it WAY easier putting the output back into a data frame that anyone could deal with, and now we have purrr making it, well, I’m still not sure about that yet! I finally decided to include examples of all that needed a separate post.

      Cheers,
      Bob

  6. Petr Simecek says:

    At rstudio::conf Hadley talked about two dialects of R.

    I would be careful about beginners. I meet more and more people that started directly with ggplot2 and tidyverse and they find Hadley’s dialect easier than base R dialect. Honestly, I have been using R for 10+ years and base graphics would always be more intuitive to me (and I know ggplot2 pretty well) and I prefer read.csv to read_csv. But beginners often have different opinion. One they I might find that I am a dinosauR.

    On the other hand side – I have met a guy using R in production. And he told me that he needs code that stayed the same for 5+ years. That is why he does not use dplyr or any tidyverse, it is still changing too much.

    • Bob Muenchen says:

      Hi Petr,

      That’s a good point. If you’re in a production shop and many people might have to support the code you write, then base R is a safe bet. It’s something that everyone would be expected to know. We’re in a university setting, where an analysis is seldom looked at again, so if my code breaks due to package changes, I just learn the new variation and fix it. That approach would drive the programmers in many shops crazy.

      Cheers,
      Bob

  7. Pingback: The Tidyverse Curse – Curated SQL

  8. Marcos F says:

    I’m using R in production since 3 years ago. And we use the latest syntax (functional pipes using tidyverse, purrr…).

    I would tell you how would I have written your same queries.
    For example:
    “`{r}
    library(tidyverse)
    mtcars %>%
    rownames_to_column() %>%
    as.tbl()

    songs_df %>%
    filter(song==”american pie”) %>%
    .[[“lyrics”]]
    “`

    • Bob Muenchen says:

      Hi Marcos,

      The order in your first example definitely makes more sense that what I showed. I was only doing it that way as I’ve had many workshop participants try something like that and wonder where the row names went. I’ve never used dplyr::as.tbl since dplyr::as_data_frame is easy to relate to the built-in as.data.frame. Checking with Hadley & Garrett’s book, R for Data Science, I see they recommend tibble::as_tibble.

      Your second example is interesting as it shows a nice blend of tidyverse and base R. Once I start down the tidyverse road, I seldom get back to thinking of brackets again. It’s good to be reminded that they fit in well!

      Cheers,
      Bob

  9. Hilary Browning says:

    Really interesting post – thank you for sharing! I am definitely a dinosaur and really dislike almost the entirety of the tidyverse. I need to get over it though since the inherent dislike of new things has helped exactly zero people in tech, ever. In addition, I can’t stand seeing outdated Python idioms/style, so not sure what the difference is there – perhaps it has to do with base R being my first ever introduction to programming. Affection for old acquaintances dies hard…

    [Although in support of base R, I really must mention that head(mtcars, n=10) would have gotten you the print you wanted – no tibble needed. And achieving sauropod status in 3..2..1 🙂 ]

    • Bob Muenchen says:

      Hi Hilary,

      Thanks for your comments. The head function does indeed show a nice subset of rows, but if you have lots of columns, it gets quite messy! The tidy approach looks at the width of your screen & shows you as many columns as will fit.

      Cheers,
      Bob

  10. Pingback: #067: R You Considering Python? - The Digital Analytics Power Hour

Leave a Reply