by Robert A. Muenchen
R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.
If you have experience with other data science tools, you may at first find R very alien. Training and documentation that leverages your existing knowledge and which points out where your previous knowledge is likely to mislead you can save much frustration. This is the approach I use in my books, R for SAS and SPSS Users and R for Stata Users, as well as in the workshops that are based on those books.
Below is a list of complaints about R that I commonly hear from people taking my R workshops. By listing these, I hope R beginners will be forewarned, will become aware that many of these problems come with benefits, and may consider the solutions offered by the add-on packages that I suggest. As many have said, R makes easy things hard, and hard things easy. However, add-on packages help make the easy things easy as well.
Lack of Graphical User Interface (GUI)
Like most other packages, R’s full power is only accessible through programming. However unlike many others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SAS Studio, SPSS or Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.
R’s help files are often thorough and usually contain many working examples. However, they’re definitely not written for beginners! My favorite example of this is the help file for one of the first commands that beginners learn: print. The SAS help file for its print procedure says that it “Prints observations in a SAS data set using some or all of the variables." Clear enough. The R help file for its print function says, “print prints its argument and returns it invisibly (via invisible(x)). It is a generic function which means that new printing methods can be easily added for new classes." The reader is left to wonder what “invisible" output looks like and what methods and classes are. The help files will tell you more about “methods" but not “classes". You have to know to look for help on “class" to find that.
Another confusing aspect to R’s help files stems from R’s ability to add new features to existing functions as you load add-on packages. This means you can’t simply read a help file, understand it, and you’re done learning that function forever. However, it does mean that you have fewer commands to learn. For example, once you learn to use the predict function, when you load a new package, that function may gain new abilities to deal with model objects that are computed specifically within the new package.
So an R beginner has to learn much more than a SAS or SPSS beginner before he or she will find the help files very useful. However, there is a vast array of tutorials, workshops and books available, many of them free, to get beginners over this hump.
Too Many Commands
Other data science packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite, which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, analysis of variance (ANOVA) table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There are other commands such as “coefficients” that display only that part of the model.
It is relatively easy to recognize the correct answer on a multiple-choice test, but much harder to recall it from scratch for an essay exam. While SAS/SPSS output may include much more than you wanted to see, it allows you to recognize the pieces of output you wanted, without having to recall the command to get it. R’s piecemeal approach to commands means you are more dependent upon recall, making it inherently harder to learn.
However, R’s focus is always on enabling each output to be used as input to further analysis. Its “get just what you ask for” approach makes this easy, and that attracts developers. People looking only to use methods programmed by others (the great majority) will then benefit from the vast array of packages that developers make available.
Misleading Function or Parameter Names
The most difficult time people have learning R is when functions don’t do the “obvious" thing. For example when sorting data, SAS, SPSS and Stata all use commands appropriately named “sort." Turning to R, they look for such a command and, sure enough, there is one named exactly that. However, it does not sort data sets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the order function sorts data sets and it does so in a somewhat convoluted way. However, the dplyr package has an arrange function that sorts data sets and it is quite easy to use.
Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. Other packages require sorting before they can do three common tasks: (1) summarizing / aggregating data, (2) repeating an analysis for each group (“by" or “split-file" processing) and (3) merging files by key variables. R does not need the user to explicitly sort datasets before performing any of these tasks!
Another command that commonly confuses beginners is the simple “if” function. While it is used to recode variables (among other tasks) in other software, in R if controls the flow of commands, while ifelse performs tasks such as recoding.
Inconsistent Function Names
All languages have their inconsistencies. For example, it took SPSS developers decades before they finally offered a syntax-checking text editor. I was told by an SPSS insider that they would have done it sooner if the language hadn’t contained so many inconsistencies. SAS has its share of inconsistencies as well, with OUTPUT statements for some procedures and OUT options on others. However, I suspect that R probably has far more inconsistencies than most since it lacks a consistent naming convention. You see names in: alllowercase, period.separated, underscore_separated, lowerCamelCase and UpperCamelCase. Some of the built-in examples include:
names, colnames row.names, rownames rowSums, rowsum rowMeans, (no parallel rowmean exists) browseURL, contrib.url, fixup.package.URLs package.contents, packageStatus getMethod, getS3method read.csv and write.csv, load and save, readRDS and saveRDS Sys.time, system.time
When you include add-on packages, you can come across some real “whoppers!" For example, R has a built-in reshape function, the Hmisc package has a reShape function (case matters), and there are both reshape and reshape2 packages that reshape data, but neither of them contain a function named “reshape"! The most popular R package for reshaping data is tidyr, with its gather and spread commands.
Since everyone is free to add new capabilities to R, the resulting code for different R packages is often a bit of a mess. For example, the two blocks of code below do roughly the same thing but using radically different syntaxes. This type of inconsistency is common in R, but there’s no way to get around it given everyone can add to it as they like.
library("Deducer") descriptive.table( vars = d(mpg,hp), data = mtcars, func.names =c("Mean", "Median", "St. Deviation", "Valid N", "25th Percentile", "75th Percentile"))
library("RcmdrMisc") numSummary( data.frame(mtcars$mpg, mtcars$hp), statistics = c("mean", "sd", "quantiles"), quantiles = c(.25, .50, .75))
Dangerous Control of Variables
In R, a single analysis can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they were convinced had the exact same observations in precisely the same order. However, a quick check on the files showed that they usually did not match up! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses, R seems to be asking for disaster and I recommend merging files when possible by key variables before doing an analysis whenever possible.
Why does R allow such a dangerous operation? Because it provides useful flexibility. For example, from one dataset, you might plot regression lines of variable x against variable y for each of three groups on the same plot. From another dataset you might get group labels to add directly on the plot. This lets you avoid a legend that makes your readers look back and forth between the legend and the lines. The label dataset would contain only three variables: the group labels and their x-y locations. That’s a dataset of only 3 observations so merging that with the main data set makes little sense.
Inconsistent Ways to Analyze Multiple Variables
One of the first functions beginners typically learn is summary(x). As you might guess, it gets summary statistics for the variable x. That’s simple enough. You might guess that to analyze two variables, you would just enter summary(x, y). However, many functions in R, including this one, accept only single objects. The solution is to put those two variables into a single object such as a data frame: summary(data.frame(x,y)). So the generalization you need to make is not from one variable to multiple variables, but rather from one object (a variable) to another object (a dataset).
If that were the whole story, it would not be that hard to learn. Unfortunately, R functions are quite inconsistent in both what objects they accept and how many. In contrast to the summary example above, R’s max function can accept any number of variables separated by commas. However, its cor function cannot; they must be in a matrix or a data frame. R’s mean function accepts only a single variable and cannot directly handle multiple variables even if they are in a single data frame. The popular graphing package ggplot2 does not accept variables unless they’re combined into a data frame. These frustrating inconsistencies simply need to be memorized.
Overly Complicated Variable Naming and Renaming Process
People are often surprised when they learn how R names and renames its variables. Since R stores the names in a character vector, renaming even one variable means that you must first locate the name in that vector before you can put the new name in that position. That’s much more complicated than the simple newName=oldName form used by many other languages.
While this approach is more complicated, it offers great benefits. For example, you can easily copy all the names from one dataset into another. You can also use the full range of string manipulations (such as regular expressions) allowing you to use many different approaches to changing names. Those are capabilities that are either impossible or much more difficult to perform in other software.
If the data are coming from a text file, I recommend simply adding the names to a header line at the top of the file. If you need to rename them later, I recommend the dplyr package’s rename function. You can switch to R’s built-in approach when you need more flexibility.
Poor Ability to Select Variables
Most data science packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z), that share a common prefix (e.g. varA, varB,…) or that contain numeric suffixes (e.g. x1-x25, not necessarily next to one another). R generally lacks a built-in ability to make these selections easily, but the dplyr package’s select function is both easy and powerful.
R’s built-in approach does have offer a significant advantage. It is far more flexibility in variable selection than other software when you use a bit of additional programming. However, those shortcuts and more are provided by the dplyr package’s select function.
Too Many Ways to Select Variables
If variable x is stored in mydata and you want to get the mean of x, most software only offers you one way to do that, such as “VAR x;" in SAS. In R, you can do it in many different ways:
summary(mydata$x) summary(mydata$"x") summary(mydata["x"]) summary(mydata[,"x"]) summary(mydata[["x"]]) summary(mydata) summary(mydata[,1]) summary(mydata[]) with(mydata, summary(x)) attach(mydata) summary(x) summary(subset(mydata, select=x)) library("dplyr") summary(select(mydata, x)) mydata %>% with(summary(x)) mydata %>% summary(.$x) mydata %$% summary(x)
To add to the complexity, if we simply asked for the mean instead of several summary statistics, several of those approaches would generate error messages because they select x in a way that the mean function will not accept.
To make matters even worse, the above examples work with data stored in a data frame (what most software calls a dataset) but not all them work when the data are stored in a matrix.
Why are there so many ways to select variables? R has many data structures that give it great flexibility, and each can use slightly different approaches to variable selection. When you fully integrate matrix algebra language capabilities into the main language, rather than have it exist as a separate language such as SAS/IML, this too requires that you see more of what is often hidden, until you need it, in other software. The last few of examples above come from the dplyr package, which makes variable selection much easier, but of course that also means having to learn more.
Too Many Ways to Transform Variables
Data science software typically offers one way to transform variables. SAS has its data step. SPSS has the COMPUTE statement and so on. R has several approaches, some of which are shown below. Here are some ways R can create a new variable named “mysum" by adding two variables, x and y:
mydata$mysum <- mydata$x + mydata$y mydata$mysum <- with(mydata, x + y) mydata["mysum"] <- mydata["x"] + mydata["y"] attach(mydata) mydata$mysum <- x + y detach(mydata) mydata <- within(mydata, mysum <- x + y ) mydata <- transform(mydata, mysum = x + y) library("dplyr") mydata <- mutate(mydata, mysum = x + y) mydata <- mydata %>% mutate(mysum = x + y)
Some are variations variable selection methods, such as the use of the attach function which, for transformation purposes, also requires the use of the “mydata$" prefix (or an equivalent form) to ensure that the new variable gets stored in the original data frame. Leaving that out leads to a major source of confusion as beginners assume the new variable will go into the attached data frame (it won’t!) The use of the “within” function parallels the use of the similar “with” function for variable selection, but it allows for variable modification while “with” does not.
The cost of this situation is clear. The benefit comes from the integration of multiple types of commands (macro, matrix, etc.) and data structures.
Not All Functions Accept All Variables
In most data science software, a variable is a variable, and all procedures accept them. In R however, a variable could be a vector, a factor, a member of a data frame or even a component of a complex structure in R called a list. For each function you have to learn what it will accept for processing. For example, most simple statistical functions for the mean, median, etc. will accept variables stored as vectors. They’ll also accept variables in datasets or lists, but only if you select them in such a way that they become vectors on the fly.
This complexity is again the unavoidable byproduct of R’s powerful set of data structures which includes vectors, factors, matrices, data frames, arrays and lists (more on this later). Despite adding complexity, it offers a wide array of advantages. For example, categorical variables that are stored as factors can be included in a regression model and R will automatically generate the dummy or indicator variables needed to make such a variable work well in the model.
Confusing Name References
Names must sometimes be enclosed in quotes, but at other times must not be. For example, to install a package and load it into memory you can use these inconsistent steps:
You can make those two commands consistent by adding quotes around “dplyr" in the second command, but not by removing them from the first. To remove the package from memory you use the following command, which refers to the package in yet another way (quotes optional):
Poor Ability to Select Data Sets
The first task in any data analysis is selecting a data set to work with. Most other software have ways to specify the data set to use that is (1) easy, (2) safe and (3) consistent. R offers several ways to select a data set, but none that meets all three criteria. Referring to variables as mydata$myvar works in many situations, but it’s not easy as you end up typing “mydata" over and over. R has an attach function, but its use is quite tricky, giving beginners the impression that new variables will be stored there (by default, they won’t) and that R will pick variables from that dataset before it looks elsewhere (it won’t). Some functions offer a data argument, but not all do. Even when a function offers that argument, it only works if you specify a model formula too (e.g. paired tests don’t need formulas).
If this is so easy in other software but so confusing in R, what’s the point? Part of it is the price to pay for great flexibility. R is the only software I know of that allows you to include variables from multiple datasets in a single analysis. So you need ways to change datasets in the middle of an analysis. However, part of that may simply be design choices that could have been better in hindsight. For example, it could have been designed with a data argument in all functions, with a system-wide option to look for a default data set, as SAS has.
R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses its “apply" family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same. The benefit that this offers is that it’s sometimes easier to write and understand apply functions.
Functions That Act Like Procedures
Many other packages, including SAS, SPSS, and Stata have procedures or commands that do typical data analyses which go “down" through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and many functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply" family of functions to force them to go in either direction. So it’s true that in R, functions act like both procedures and functions from other packages. If you’re coming to R from other software, that’s a radically new approach.
Odd Treatment of Missing Values
In all data analysis packages that I’m aware of, missing values are treated the same: they’re excluded automatically when (1) selecting observations and (2) performing analyses. When selecting observations, R actually inserts missing values! For example, say you have this data set:
Gender English Math male 85 82 male 72 87 75 81 female 77 78 female 98 91
If you select the males using mydata[mydata$gender == “male", ] R will return the top three lines, and substitute its missing value symbol, NA, in place of the values 75 and 81 for the third observation. Why create missing where there were none before? It’s as if the R designers considered missing values to be unusual and thought you needed to be warned of their existence. In my experience, missing values are so common that when I get a data set that appears to have none, I’m quite suspicious that someone has failed to set some special code to be missing. There’s a solution to this in R, which is to ask for “which” of the observations is the logic true with mydata[which(mydata$gender == “male”), ]. The dplyr package also avoids this with the function filter(mydata, gender == “male”).
When performing more complex analyses, using what R calls “modeling functions," missing values are excluded automatically. However, when it comes to simple functions such as mean or median, it does the reverse by returning a missing value result if the variable contains even a single missing value. You get around that my specifying na.rm=TRUE on every function call, but why should you have to? While there are system-wide options you can set for many things such as width of output lines, there’s no option to avoid this annoyance. It’s not hard to create your own function that has missing values removed by default, but that seems like overkill for such a simple annoyance.
Neither of these conditions seems to offer any particular benefit to R. They’re minor inconveniences that R users learn to live with.
Odd Way of Counting Valid or Missing Values
The one function that could really benefit from excluding missing values, the length function, cannot exclude them! While most package include a function named something like n or nvalid, R’s approach to counting valid responses is to (1) check if each value is missing with the is.na function, then (2) use the not symbol “!" to find the non-missing. You have to know that (3) this generates a vector of true/false values that have numeric value of 1/0 respectively. Then you add them up (4) with the sum function. That’s an awful lot of complexity compared to n(x). However, it’s easy to define your own n function or you can use add-on packages that already contain one, such as the prettyR package’s n.valid function.
Too Many Data Structures
As previously mentioned, R has vectors, factors, matrices, arrays, data frames (datasets) and lists. And that’s just for starters! Modeling functions create many variations on these structures and they also create whole new ones. Users are free to create their own data structures, and some of these have become quite popular. Along with all these structures comes a set of conversion functions that switch an object’s structure from one type to another, when possible. Given that so many other analytics packages get by with just one structure, the dataset, why go to all this trouble? If you added the various data structures that exist in other packages’ matrix languages, you would see a similar amount of complexity. Additional power requires additional complexity.
Two-dimensional objects easily become one-dimensional objects. For example, this way of selecting males uses the variable “Gender" as part of a two-dimensional data frame, so you need to have a comma follow the logical selection:
with(Talent[Talent$Gender == "Male", ], summary( data.frame(English, Reading) ) )
But in this very similar way of getting the same thing, Gender is selected as a one-dimensional vector and so adding the comma (which implies a second dimension) would generate an error message:
with(Talent, summary( data.frame( English[Gender == "Male"], Reading[Gender == "Male"] ) )
R’s output is often quite sparse. For example, when doing cross-tabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a cross-tabulation is as simple as barplot(table(a,b)). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages. Some of them are shown here.
The default output from SAS and SPSS is nicely formatted making it easy to read and paste your word processor as fully editable tables. All future output can be set to journal style (or many other styles) by simply setting an option. R not only lacks true tabular output by default, it does not even provide tabs between columns. So beginners commonly paste the output into their word processor and select a mono-spaced font to keep the columns aligned.
R does have the ability to get its output looking good, but it does so using additional packages such as compareGroups, knitr, pander, sjPlot, sweave, texreg, xtable, and others. In fact, its ability to display output in complex tables (e.g. showing multiple models) is better than any other package I’ve seen.
Complex By-Group Analysis
Most packages let you repeat any analysis simply by adding a command like “by group" to it. R’s built-in approach is far more complex. It requires you to create a macro-like function that does the analysis steps you need. Then you apply that function by group. Other languages let you avoid learning that type of programming until you’re doing more complex tasks. However, the deeper integration of such macro-like facilities in R means that the functions you write are much more integrated into the complete system.
Complex & Inconsistent Output Management
Strongly related to by-group processing (above) is output management. Printed output from an analysis isn’t always the desired end result. Output often requires further processing. For example, at my university we routinely do the same salary regression model for each of over 250 departments. We don’t care about most of that output, we only want to see results for the few departments whose salaries seem to depend on gender or ethnicity. We’re hoping to find none of course!
Using base R, here are the steps required:
1. Create a function that does the regression.
2. Apply the function using one of: by, tapply, plyr’s dlply.
3. Save the output to a list.
4. Study the structure of that list to figure out where the needed values are stored.
5. Write an extractor function to pull out the needed values.
6. Apply that function using lapply or plyr’s ldply to create a useful data frame.
While these are the general steps, each type of model function creates a list that’s unique, with pieces of output stored inside the list in many types of structures, and labeled in inconsistent ways. For example, p-values from three different types of analysis might be labeled, “p", “pvalue", or “p.value". Luckily an add-on package named broom has commands that will convert the output you need into data frames, standardizing the names as it does so.
R’s complexity here is not without its advantages. For example, it would be relatively easy to do two regression models per group, save the output, and then do an additional set of per-group analysis that compared the two models statistically. That would be quite a challenge for most statistics packages (see the help file of the dplyr package’s “do” function for an example.)
Unclear Way to Clear Memory
Another unusually complex approach R takes is the way it clears its workspace in your computer’s memory. While a simple “clear" command would suffice, the R approach is to ask for a listing of all objects in memory and then delete that list: rm(list = ls() ). What knowledge is required to understand this command? You have to (1) know that “ls()" lists the objects in memory, (2) what a character vector is because (3) “ls()" will return the object names in a character vector, (4) that rm removes objects, (5) that the “list" argument is not really asking for information stored in an R list, but rather in a character vector. That’s a lot of things to know for such a basic command!
This approach does have its advantages though. The command that lists objects in memory has a powerful way to search for various patterns in the names of variables or data sets that you might like to delete.
Luckily, the popular RStudio front-end to R offers a broom icon that clears memory with a single click.
All analytics software has names for their variables, but R is unique in the fact that it also names its rows. This means you must learn how to manage row names as well as variable names. For example, when reading a comma-separated-values file, variable names often appear as the first row, and all analytics software can read those names. An identifier variable such as ID, Personnel_Num, etc. often appears in the first column of such files. Using R’s read.csv() function, if that variable is not named, R will assume that it’s an ID variable and it will convert its values into row names. However if it is named – as all other software would require – then you add some options to tell R to put its values, or the values of any other variable you choose, into the row names position. Once you do that, the name of the original ID variable vanishes. The benefit to this odd behavior is that when analyses or plots need to identify observations, R automatically knows where to get those names. This saves R from needing the SAS equivalent to the ID statement (e.g. in PROC CLUSTER).
While this looks like a worthwhile trade-off, it is complicated by the fact that row names must be unique. That means you cannot maintain the original row names when you stack two files that have the same variables if you measured the same observations at two times, perhaps before and after some treatment. It also means that combining by-group output can be tricky, though the broom package takes that into account for you. The popular dplyr package replaces row names with character values of the consecutive integers, 1, 2, 3…. My advice is to handle your own ID variables as standard variables and put them into the row names position only when an R function offers some you some benefit in return. (For a dplyr example, see The Tidyverse Curse below.)
The Tidyverse Curse
There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.
A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:
> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).
> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:
> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"
Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)
> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.
In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.
Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.
This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.
Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used
The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...
These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.
Lack of Beneficial Constraints
To this point, I have listed the main aspects of R that confuse beginners. Many of them are the result of R’s power and flexibility. R tightly integrates commands for (1) data management, (2) data analysis and graphics, (3) macro facilities (4) output management and (5) matrix algebra capabilities. That combination, along with its rich set of data structures and the ease with which anyone can extend its capabilities, make R a potent tool for data science.
However, the converse of that is provided by its competitors, such as SAS, SPSS and Stata. A notable advantage that they offer is “beneficial constraints." These packages use one kind of input, provide one kind of output, provide very limited ways to select variables, provide one way to rename variables, provide one way to identify cases, provide one way to do by-group analyses, and are limited in a variety of other ways. Their macro and matrix languages, and output management systems, are separated enough as to be almost invisible to beginners and even intermediate users.
Those constraints limit what you can do and how you can do it, making the software much easier to learn and use, especially for those who only occasionally need the software. They provide power that’s perfectly adequate to solve any problem for which they have pre-written solutions. And the set of solutions they provide is rich. (Stata also provides an extensive set of user-written solutions.)
There are many aspects of R that make it hard for beginners to learn. Some of these are due to R’s unique nature, some are due to its power, flexibility and extensibility. Others are due to aspects of its design that don’t really offer benefits. However, R’s power, free price, and open source nature attract developers, and the wide array of add-on tools have resulted in software that is growing rapidly in popularity.
R has been my main data science tool of choice for over a decade now, and I have found its many benefits to be well worth its peccadilloes. If you currently program in SAS, SPSS, or Stata and find its use comfortable, then I recommend giving R a try to see how you like it. I hope this article will help prepare you for some of the pitfalls that you’re likely to run into by explaining how some of them offer long-term advantages, and by offering a selection of add-on packages to help ease your transition.
However, if you use one of those languages and view it as challenging, then learning R may not be for you. Don’t fret, you can enjoy the simplicity of other software, while calling R functions from that software as I describe here.
If you’re already an R user and I’ve missed any of your pet peeves, please list them in the comments section below.