I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:
The Tidyverse Curse
There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.
A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:
> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).
> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:
> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"
Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)
> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.
In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.
Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.
This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.
Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used
The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...
These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.
Nice article. It feels like the tidyverse is a very useful dialect of R. I am happily becoming a native speaker of this dialect! I’m looking forward to the improvements coming to non standard evaluation as I think at the moment that is still quite confusing even for experienced developers.
Hi Mark,
Yes, despite the nit-picks, I love it!
Cheers,
Bob
Agreed that `dplyr` and its friends are a brand new world as far as R is concerned.
One possible solution to forgetting a package is to call `library(tidyverse)` rather than loading each package separately.
Hi El-ad,
That’s definitely the way to go now that the tidyverse package exists. The dplyr package preceded it by a couple of years though, so many blog posts still exist showing just dplyr. I wanted to point out what happens if that’s all you load.
Cheers,
Bob
Interesting post. You raise some interesting points. There’s a lot of R syntax that does not make sense but still exists because of backwards compatibility. I think the people creating R don’t want to pull a Python where a major break is made (from Python 2.7 to Python 3) and old code becomes invalidated, but the price for that is more complicated systems.
So you think it’s still worth teaching the tidyverse to beginners? I did that the last time I taught the R lab at the University of Utah Math dept. Clearly the base system should be taught because that functionality is guaranteed, whereas tidyverse compatibility is not, and it is still a package that is not automatically installed and included with R. And as you said, the tidyverse is generally easier, but it makes R overall more difficult to learn since it is learning yet another way to do essentially the same thing.
Hi ntguardian,
That’s a question I wrestled with for a long time. What I teach now is a mix. For example, I start with a complete but very simple program that reads data with readr’s read_csv, then modifies it with dplyr’s mutate, then does a couple of simple analyses. I then show what the program would look like using base r. I don’t really review the base R version except to point out how much more complex read.csv is, and to note that the program is filled with additional “mydata$” prefixes all over the place. I do cover lapply & sapply in my first workshop. Later when I get to a workshop on data management, I cover mutate_each and summarise_each instead. Whenever base R does something much easier, such as copy all the names from one data frame to another, I teach that along with dplyr’s rename function which is much easier when changing just a few names. I have the benefit that I can tell my workshop participants that all my tidyverse examples are repeated using base R in my books, R for SAS and SPSS Users, and R for Stata Users.
Cheers,
Bob
Why not glimpse()?
Hi Koma,
I had a really hard time deciding what examples to include. glimpse vs. str is a contender. I thought about including an example of processing the output of group_by models. Learning the built-in approach is really complex. plyr made it somewhat easier (though slower), then dplyr made it way faster but blew everyone’s mind by storing entire models in the elements of a vector, then broom made it WAY easier putting the output back into a data frame that anyone could deal with, and now we have purrr making it, well, I’m still not sure about that yet! I finally decided to include examples of all that needed a separate post.
Cheers,
Bob
At rstudio::conf Hadley talked about two dialects of R.
I would be careful about beginners. I meet more and more people that started directly with ggplot2 and tidyverse and they find Hadley’s dialect easier than base R dialect. Honestly, I have been using R for 10+ years and base graphics would always be more intuitive to me (and I know ggplot2 pretty well) and I prefer read.csv to read_csv. But beginners often have different opinion. One they I might find that I am a dinosauR.
On the other hand side – I have met a guy using R in production. And he told me that he needs code that stayed the same for 5+ years. That is why he does not use dplyr or any tidyverse, it is still changing too much.
Hi Petr,
That’s a good point. If you’re in a production shop and many people might have to support the code you write, then base R is a safe bet. It’s something that everyone would be expected to know. We’re in a university setting, where an analysis is seldom looked at again, so if my code breaks due to package changes, I just learn the new variation and fix it. That approach would drive the programmers in many shops crazy.
Cheers,
Bob
For R in production and long-term syntax stability, you might use either packrat in conjunction with a project, or Docker. Packrat has less overhead: It just provides a package folder within a project, so that you can keep it as-is (i.e. not install updates), while still updating your regular R installation’s packages.
Docker is sort of a virtual machine, which can also be kept stable over time, once it’s running.
I’m using R in production since 3 years ago. And we use the latest syntax (functional pipes using tidyverse, purrr…).
I would tell you how would I have written your same queries.
For example:
“`{r}
library(tidyverse)
mtcars %>%
rownames_to_column() %>%
as.tbl()
songs_df %>%
filter(song==”american pie”) %>%
.[[“lyrics”]]
“`
Hi Marcos,
The order in your first example definitely makes more sense that what I showed. I was only doing it that way as I’ve had many workshop participants try something like that and wonder where the row names went. I’ve never used dplyr::as.tbl since dplyr::as_data_frame is easy to relate to the built-in as.data.frame. Checking with Hadley & Garrett’s book, R for Data Science, I see they recommend tibble::as_tibble.
Your second example is interesting as it shows a nice blend of tidyverse and base R. Once I start down the tidyverse road, I seldom get back to thinking of brackets again. It’s good to be reminded that they fit in well!
Cheers,
Bob
Really interesting post – thank you for sharing! I am definitely a dinosaur and really dislike almost the entirety of the tidyverse. I need to get over it though since the inherent dislike of new things has helped exactly zero people in tech, ever. In addition, I can’t stand seeing outdated Python idioms/style, so not sure what the difference is there – perhaps it has to do with base R being my first ever introduction to programming. Affection for old acquaintances dies hard…
[Although in support of base R, I really must mention that head(mtcars, n=10) would have gotten you the print you wanted – no tibble needed. And achieving sauropod status in 3..2..1 🙂 ]
Hi Hilary,
Thanks for your comments. The head function does indeed show a nice subset of rows, but if you have lots of columns, it gets quite messy! The tidy approach looks at the width of your screen & shows you as many columns as will fit.
Cheers,
Bob
I thought it was just me thinking Tidyverse seemed a completely foreign language (I can do reasonably well in “classic” R). It is not uncommon to introduce a very different language to the existing one, such as PROC SQL in SAS, but in this case I feel more comfortable since SQL is a more standard and widely used language. Those who are familiar with SQL can get on with SAS quickly; and those who were not familiar with SQL initially (like myself) I suppose would happy to learn it since it is useful to learn SQL in data analytic field anyway. I hope R could have something like that.
Hi Chao,
I agree that the introduction of the tidyverse to R is very analogous to the addition of PROC SQL to SAS. Many of the tidyverse (or more specifically dplyr) functions have direct analogs to SQL. That’s what makes it so easy for dplyr to translate its commands into SQL for high-speed execution inside a relational database. A more direct implementation of SQL is in the sqldf package. Like PROC SQL, it allows you to use actual SQL commands on any data frame. Below are my notes on it from one of my workshops.
Cheers,
Bob
#—SQL1: THE sqldf PACKAGE—
#
# dplyr commands are very SQL-like but
# follow R syntax. They can generate
# SQL so it executes in-database, but
# it doesn’t let you use SQL directly
#
# The sqldf package sends data frames
# to SQLite, allowing you to use SQL
# against a data frame
#
# Speed is good
#
# Support is very good for standard SQL,
# but it lacks vendor-specific commands
#—SQL2: PRINTING A DATA FRAME—
#
setwd(“~/R4DATA”)
load(“mydata100.RData”)
library(“sqldf”)
sqldf(“select * from mydata100”)
#—SQL3: SELECTING AND SORTING—
sqldf(“select workshop, gender, q4
from mydata100
where workshop = ‘R’
order by gender”)
#—SQL4: AGGREGATING BY GENDER—
#
sqldf(“select gender as Sex,
avg(q4) ‘Mean Q4’
from mydata100
group by gender”)
#—SQL5: NOTE KEY SYNTAX DIFFERENCES—
#
# SQL, unlike R, uses…
#
# No commas between arguments (clauses)
# Uses “and”, “or”, “not”
# rather than “&”, “|”, “!”
# Single “=” not “==” for equivalence
# Single quotes around ‘strings’
# Ignores case of text
# Uses NULL instead of NA for missing
# For more, see http://www.burns-stat.com/
# translating-r-sql-basics/
# help(“sqldf”) has good examples
Great article, thanks!
Teaching R for beginners, some parts of the tidyverse seems very helpful to me – like the tidy-data (each row is an observation, each column is a variable, each cell is a value), the pipe and the basic verbs of dplyr. You can do a lot with a few R commands and it feels very “natural”. But plotting with ggplot is getting complicated for beginners (that was my exerience).
But showing data as a plot (instead of tables) is so much more motivating for beginners. Therefore I started to use a very simplistic package for plotting in a beginner-course – the explore package (that fits into the tidyverse)
If you are coming from SPSS/SAS with a nice GUI, it may be very frustrating writing a lot of code to do “simple” things like creating a meaningful plot. It is motivating to give beginners an instant success if they start using R. The explore package offers even a simple GUI for a quick start, but after a while beginners shift to (more detailed) code naturally.
library(tidyverse)
library(explore)
## explore a dataset (interactive with a GUI)
explore(iris)
## explore a variable of a dataset
## you don’t need to know in advance the type of the variable
## (in base R you had to choose between bar and hist)
iris %>% explore(Sepal.Length)
iris %>% explore(Species)
## explore the relationship between a variable
## of a dataset and a target-variable
## you don’t need to know in advance the type of the variable
## and the type of the target variable
iris %>% explore(Sepal.Length, target = Species)
## explore all variables of a dataset
## you don’t need to know in advance the type of the variables
## and the type of the target variable
iris %>% explore_all(target = Species)
Hi Mintzgaertnerin,
The simplicity of your explore package reminds me of the ggformula package. That uses R’s formula interface to simplify getting plots from ggplot2. There’s also a companion package “mosaic” that adds the formula interface (along with the “data=” argument) to R’s built-in stat functions.
Cheers,
Bob
My advice as a data scientist with over 5 years experience. DO NOT learn the tidyverse. Learn a language, not a reinterpretation of one. This means you will not be tied down to one package and one design pattern i.e. one way of doing things.
Hi Sean,
Thanks for leaving your view.
Cheers,
Bob
With all due respect, dplyr “grammar” is shit. It is even worse for beginners, as they end up not learning R or dplyr…
Hi Santo,
A significant proportion of R users agree with you. I like the way Norm Matloff describes it: https://github.com/matloff/TidyverseSkeptic.
Cheers,
Bob
Thanks for the link; interesting perspective. I just took an IPSA (poli sci) workshop on regression and, boy, did the other students (with no coding background) suffer because the TA decided to use tidyhell. Running scripts is all fun and games until you have to actually understand what you’re doing…