by Robert A. Muenchen
R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.
If you have experience with other analytics tools, you may at first find R very alien. Training and documentation that leverages your existing knowledge and which points out where your previous knowledge is likely to mislead you can save much frustration. This is the approach I use in my books, R for SAS and SPSS Users and R for Stata Users as well as the workshops that are based on them.
Below is a list of complaints about R that I commonly hear from people taking my R workshops. By listing these, I hope R beginners will be forewarned, will become aware that many of these problems come with benefits, and may consider the solutions offered by the add-on packages that I suggest. As many have said, R makes easy things hard, and hard things easy. However, add-on packages help make the easy things easy as well.
R’s help files are often thorough and usually contain many working examples. However, they’re definitely not written for beginners! My favorite example of this is the help file for one of the first commands that beginners learn: print. The SAS help file for its print procedure says that it “Prints observations in a SAS data set using some or all of the variables." Clear enough. The R help file for its print function says, “print prints its argument and returns it invisibly (via invisible(x)). It is a generic function which means that new printing methods can be easily added for new classes." The reader is left to wonder what “invisible" output looks like and what methods and classes are. The help files will tell you more about “methods" but not “classes". You have to know to look for help on “class" to find that.
Another confusing aspect to R’s help files stems from R’s ability to add new capabilities (called methods) to some functions as you load add-on packages. This means you can’t simply read a help file, understand it, and you’re done learning that function forever. However, it does mean that you have fewer commands to learn. For example, once you learn to use the predict function, when you load a new package, that function may gain new abilities to deal with model objects that are computed specifically within the new package.
So an R beginner has to learn much more than a SAS or SPSS beginner before he or she will find the help files very useful. However, there is a vast array of tutorials, workshops and books available, many of them free, to get beginners over this hump.
Misleading Function or Parameter Names
The most difficult time people have learning R is when functions don’t do the “obvious" thing. For example when sorting data, SAS, SPSS and Stata users all utilize commands appropriately named “sort." Turning to R, they look for such a command and, sure enough, there’s one named exactly that. However, it does not sort datasets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the order function sorts data sets and it does so in a somewhat convoluted way. However the dplyr package has an arrange function that sorts data sets and is quite easy to use.
Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. When other packages require sorting before they can do three common tasks: (1) summarizing / aggregating data, (2) repeating an analysis for each group (“by" or “split-file" processing) and (3) merging files by key variables. R does not need the user to explicitly sort datasets before performing any of these tasks!
Another command that commonly confuses beginners is the simple “if” function. While it is used to recode variables (among other tasks) in other software, in R if controls the flow of commands, while ifelse performs tasks such as recoding.
Too Many Commands
Other statistics packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, analysis of variance (ANOVA) table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There are other commands such as “coefficients” that display only that part of the model. So there are more commands to learn, but fewer options are needed for each.
Since everyone is free to add new capabilities to R, the resulting code for different R packages is often a bit of a mess. For example, the two blocks of code below do roughly the same thing but using radically different syntaxes. This type of inconsistency is common in R, but there’s no way to get around it given everyone can add to it as they like.
library("Deducer") descriptive.table( vars = d(mpg,hp), data = mtcars, func.names =c("Mean", "Median", "St. Deviation", "Valid N", "25th Percentile", "75th Percentile"))
library("RcmdrMisc") numSummary( data.frame(mtcars$mpg, mtcars$hp), statistics = c("mean", "sd", "quantiles"), quantiles = c(.25, .50, .75))
All analytics software has names for their variables, but R is unique in the fact that it also names its rows. This means you must learn how to manage row names as well as variable names. For example, when reading a comma-separated-values file, variable names often appear as the first row, and all analytics software can read those names. An identifier variable such as ID, Personnel_Num, etc. often appears in the first column of such files. In R, if that variable is not named, R will assume that it’s an ID variable and it will convert its values into row names. However if it is named – as all other software would require – then you add some options to tell R to put its values, or the values of any other variable you choose, into the row names position. Once you do that, the name of the original ID variable vanishes. The benefit to this odd behavior is that when analyses or plots need to identify observations, R automatically knows where to get those names. This saves R from needing the SAS equivalent to the ID statement (e.g. PROC CLUSTER).
While this looks like a worthwhile tradeoff, it is complicated by the fact that row names must be unique. That means you cannot maintain the original row names when you stack two files that have the same variables if you measured the same observations at two times, perhaps before and after some treatment. It also means that combining by-group output can be tricky, though the broom package takes that into account for you. The popular dplyr package replaces row names with character values of the consecutive integers, 1, 2, 3…. My advice is to handle your own ID variables as standard variables and put them into the row names position only when an R function offers some you some benefit in return.
Dangerous Control of Variables
In R, a single analysis can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they were convinced had the exact same observations in precisely the same order. However, a quick check on the files showed that they usually did not match up! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses R seems to be asking for disaster and I recommend merging files when possible by key variables before doing an analysis.
Why does R allow such a dangerous operation? Because it provides useful flexibility. For example, from one dataset, you might plot regression lines of variable x against variable y for each of three groups on the same plot. From another dataset you might get group labels to add. This lets you avoid a legend that makes your readers look back and forth between the legend and the lines. The label dataset would contain only three variables: the group labels and their x-y locations. That’s a dataset of only 3 observations so merging that with the main data set makes little sense.
Inconsistent Ways to Analyze Multiple Variables
One of the first functions beginners typically learn is summary(x). As you might guess, it gets summary statistics for the variable x. That’s simple enough. You might guess that to analyze two variables, you would just enter summary(x, y). However that’s wrong because many functions in R, including this one, accept only single objects. The solution is to put those two variables into a single object such as a data frame: summary(data.frame(x,y)). So the generalization you need to make is not from one variable to multiple variables, but rather from one object (a variable) to another object (a dataset).
If that were the whole story, it would not be that hard to learn. Unfortunately, R functions are quite inconsistent in both what objects they accept and how many. In contrast to the summary example above, R’s max function can accept any number of variables separated by commas. However, its cor function cannot; they must be in a data frame. R’s mean function accepts only a single variable and cannot directly handle multiple variables even if they are in a single data frame. These inconsistencies simply need to be memorized.
Overly Complicated Variable Naming and Renaming Process
People are often surprised when they learn how R names and renames its variables. Since R stores the names in a character vector, renaming even one variable means that you must first locate the name in that vector before you can put the new name in that position. That’s much more complicated than the simple newName=oldName form used by many other languages.
While this approach is more complicated, it offers great benefits. For example, you can easily copy all the names from one dataset into another. You can also use the full range of string manipulations (such as regular expressions) allowing you to use many different approaches to changing names. Those are capabilities that are either impossible or much more difficult to perform in other languages.
If the data are coming from a text file, I recommend simply adding the names to a header line at the top of the file. If you need to rename them later, I recommend the dplyr package’s rename function. You can convert to R’s built-in approach when you need more flexibility.
Poor Ability to Select Variables
Most data analysis packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z), that share a common prefix (e.g. varA, varB,…) or that contain numeric suffixes (e.g. x1-x25, not necessarily next to one another). R generally lacks a built-in ability to make these selections easily. While shortcuts for those aren’t built into R, it does offer far more flexibility in variable selection than other software when you use a bit of additional programming. However, those shortcuts and more are provided by the dplyr package’s “select” function.
Too Many Ways to Select Variables
If variable x is stored in mydata and you want to get the mean of x, most software only offers you one way to do that, such as “var x;" in SAS. In R, you can do it in many different ways:
summary(mydata$x) summary(mydata$"x") summary(mydata["x"]) summary(mydata[,"x"]) summary(mydata[["x"]]) summary(mydata) summary(mydata[,1]) summary(mydata[]) with(mydata, summary(x)) attach(mydata) summary(x) summary(subset(mydata, select=x)) library("dplyr") summary(select(mydata, x)) mydata %>% with(summary(x)) mydata %>% summary(.$x) mydata %$% summary(x)
To add to the complexity, if we simply asked for the mean instead of several summary statistics, several of those approaches would generate error messages because they select x in a way that the mean function will not accept.
To make matters even worse, the above examples work with data stored in a data frame (what most software calls a dataset) but not all work when the data are stored in a matrix.
Why are there so many ways to select variables? R has many data structures that give it great flexibility, and each can use slightly different approaches to variable selection. When you fully integrate matrix algebra language capabilities into the main language, rather than have it exist as a separate language such as SAS/IML, this too requires that you see more of what is often hidden, until you need it, in other software. The last few examples come from the dplyr package, which makes variable selection much easier, but of course that also means having to learn more.
Too Many Ways to Transform Variables
Analytics software typically offers one way to transform variables. SAS has its data step. SPSS has the COMPUTE statement and so on. R has several approaches, some of which are shown below. Here are some ways R can create a new variable named “mysum" by adding two variables, x and y:
mydata$mysum <- mydata$x + mydata$y mydata$mysum <- with(mydata, x + y) mydata["mysum"] <- mydata["x"] + mydata["y"] attach(mydata) mydata$mysum <- x + y detach(mydata) mydata <- within(mydata, mysum <- x + y ) mydata <- transform(mydata, mysum = x + y) library("dplyr") mydata <- mutate(mydata, mysum = x + y) mydata <- mydata %>% mutate(mysum = x + y)
Some are variations the ways to select variables, such as the use of the attach function which, for transformation purposes, also requires the use of the “mydata$" prefix (or an equivalent form) to ensure that the new variable gets stored in the original data frame. Leaving that out leads to a major source of confusion as beginners assume the new variable will go into the attached data frame (it won’t!) The use of the “within” function parallels the use of the similar “with” function for variable selection, but it allows for variable modification while “with” does not.
The cost of this situation is clear. The benefit comes from the integration of multiple types of commands (macro, matrix, etc.) and data structures.
Not All Functions Accept All Variables
In most analytics software, a variable is a variable, and all procedures accept them. In R however, a variable could be a vector, a factor, a member of a data frame or even a component of a complex structure in R called a list. For each function you have to learn what it will accept for processing. For example, most simple statistical functions for the mean, median, etc. will accept variables stored as vectors. They’ll also accept variables in datasets or lists, but only if you select them in such a way that they become vectors on the fly.
This complexity is again the unavoidable byproduct of R’s powerful set of data structures which includes vectors, factors, matrices, data frames, arrays and lists (more on this later). Despite adding complexity, it offers a wide array of advantages. For example, categorical variables that are stored as factors can be included in a regression model and R will automatically generate the dummy or indicator variables needed to make such a variable work well in the model.
Confusing Name References
Names must sometimes be enclosed in quotes, but at other times must not be. For example, to install a package and load it into memory you can use these inconsistent steps:
You can make those two commands consistent by adding quotes around “dplyr" in the second command, but not by removing them from the first. To remove the package from memory you use the following command, which refers to the package in yet another way (quotes optional):
Poor Ability to Select Data
The first task in any data analysis is selecting a data set to work with. Most other software have ways to specify the data set to use that is (1) easy, (2) safe and (3) consistent. R offers several ways to select a data set, but none that meets all three criteria. Referring to variables as mydata$myvar works in many situations, but it’s not easy as you end up typing “mydata" over and over. R has an attach function, but its use is quite tricky, giving beginners the impression that new variables will be stored there (by default, they won’t) and that R will pick variables from that dataset before it looks elsewhere (it won’t). Some functions offer a data argument, but not all do. Even when a function offers that argument, it only works if you specify a model formula too (e.g. paired tests don’t need formulas).
If this is so easy in other software but so confusing in R, what’s the point? Part of it is the price to pay for great flexibility. R is the only software I know of that allows you to include variables from multiple datasets in a single analysis. So you need ways to change datasets in the middle of an analysis. However, part of that may simply be design choices that could have been better in hindsight. For example, it could have been designed with a data argument in all functions, with a system-wide option to look for a default data set, as SAS has.
R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses its “apply" family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same. The benefit that this offers is that it’s sometimes easier to write and understand apply functions.
Functions That Act Like Procedures
Many other packages, including SAS, SPSS and Stata have procedures or commands that do typical data analyses which go “down" through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and many functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply" family of functions to force them to go in either direction. So it’s true that in R, functions act like both procedures and functions from other packages. If you’re coming to R from other software, that’s a wild new idea.
Inconsistent Function Names
All languages have their inconsistencies. For example, it took SPSS developers decades before they finally offered a syntax-checking text editor. I was told by an SPSS insider that they would have done it sooner if the language hadn’t contained so many inconsistencies. SAS has its share of inconsistencies as well, with OUTPUT statements for some procedures and OUT options on others. However, I suspect that R probably has far more inconsistencies than most since it lacks a consistent naming convention. You see names in: alllowercase, period.separated, underscore_separated, lowerCamelCase and UpperCamelCase. Some of the built-in examples include:
names, colnames row.names, rownames rowSums, rowsum rowMeans, (no parallel rowmean exists) browseURL, contrib.url, fixup.package.URLs package.contents, packageStatus getMethod, getS3method read.csv and write.csv, load and save, readRDS and saveRDS Sys.time, system.time
When you include add-on packages, you can come across some real “whoppers!" For example, R has a built-in reshape function, the Hmisc package has a reShape function (case matters), and there are both reshape and reshape2 packages that reshape data, but neither of them contain a function named “reshape"!
Odd Treatment of Missing Values
In all data analysis packages that I’m aware of, missing values are treated the same: they’re excluded automatically when (1) selecting observations and (2) performing analyses. When selecting observations, R actually inserts missing values! For example, say you have this data set:
Gender English Math male 85 82 male 72 87 75 81 female 77 78 female 98 91
If you select the males using mydata[mydata$gender == “male", ] R will return the top three lines, and substitute its missing value symbol, NA, in place of the values 75 and 81 for the third observation. Why create missing where there were none before? It’s as if the R designers considered missing values to be unusual and thought you needed to be warned of their existence. In my experience, missing values are so common that when I get a data set that appears to have none, I’m quite suspicious that someone has failed to set some special code to be missing. There’s a solution to this in R, which is to ask for which of the observations is the logic true with mydata[which(mydata$gender == “male"), ].
When performing more complex analyses, using what R calls “modeling functions," missing values are excluded automatically. However, when it comes to simple functions such as mean or median, it does the reverse by returning a missing value result if the variable contains even a single missing value. You get around that my specifying na.rm=TRUE on every function call, but why should you have to? While there are system-wide options you can set for many things such as width of output lines, there’s no option to avoid this annoyance. It’s not hard to create your own function that has missing values removed by default, but that seems like overkill for such a simple annoyance.
Neither of these conditions seems to offer any particular benefit to R. They’re minor inconveniences that R users learn to live with.
Odd Way of Counting Valid or Missing Values
The one function that could really benefit from excluding missing values, the length function, cannot exclude them! While most package include a function named something like n or nvalid, R’s approach to counting valid responses is to (1) check if each value is missing with the is.na function, then (2) use the not symbol “!" to find the non-missing. You have to know that (3) this generates a vector of true/false values that have numeric value of 1/0 respectively. Then you add them up (4) with the sum function. That’s an awful lot of complexity compared to n(x). However, it’s easy to define your own n function or you can use add-on packages that already contain one, such as the prettyR package’s n.valid function.
Too Many Data Structures
As previously mentioned, R has vectors, factors, matrices, arrays, data frames (datasets) and lists. And that’s just for starters! Modeling functions create many variations on these structures and they also create whole new ones. Users are free to create their own data structures, and some of these have become quite popular. Along with all these structures comes a set of conversion functions that switch an object’s structure from one type to another, when possible. Given that so many other analytics packages get by with just one structure, the dataset, why go to all this trouble? If you added the various data structures that exist in other package’s matrix languages, you would see a similar amount of complexity. Additional power requires additional complexity.
Two-dimensional objects easily become one-dimensional objects. For example, this way of selecting males uses the variable “Gender" as part of a two-dimensional data frame, so you need to have a comma follow the logical selection:
with(Talent[Talent$Gender == "Male", ], summary( data.frame(English, Reading) ) )
But in this very similar way of getting the same thing, Gender is selected as a one-dimensional vector and so adding the comma (which implies a second dimension) would generate an error message:
with(Talent, summary( data.frame( English[Gender == "Male"], Reading[Gender == "Male"] ) )
Complex By-Group Analysis
Most packages let you repeat any analysis simply by adding a command like “by group" to it. R’s built-in approach is far more complex. It requires you to create a macro-like function that does the analysis steps you need. Then you apply that function by group. Other languages let you avoid learning that type of programming until you’re doing more complex tasks. However, the deeper integration of such macro-like facilities in R means that the functions you write are much more integrated into the complete system.
R’s output is often quite sparse. For example, when doing cross-tabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a cross-tabulation is as simple as barplot(table(a,b)). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages. Some of them are shown here.
The default output from SAS and SPSS is nicely formatted making it easy to read and paste your word processor as complete tables. All future output can be set to journal style (or many other styles) by simply setting an option. R not only lacks true tabular output by default, it does not even provide tabs between columns. So beginners commonly paste the output into their word processor and select a mono-spaced font to keep the columns aligned.
R does have the ability to get its output looking good, but it does so using additional packages such as compareGroups, knitr, pander, sjPlot, sweave, texreg, xtable, and others. In fact, its ability to display output in complex tables (e.g. showing multiple models) is better than any other package I’ve seen.
Complex & Inconsistent Output Management
Strongly related to by-group processing (above) is output management. In its simplest case, you might do one analysis and use output management to select what output to display. R reverses the usual process by showing you very little up front and letting you ask to display more after creating a model (similar to Stata’s approach.)
However, printed output from an analysis isn’t always the desired end result. Output often requires further processing. For example, at my university we routinely do the same salary regression model for each of over 250 departments. We don’t care about most of that output, we only want to see results for the few departments whose salaries seem to depend on gender or ethnicity. We’re hoping to find none of course! R has a special data structure called a “list" that’s ideal for storing such complex output. However, each type of model function creates a list that’s unique, with pieces of output stored inside the list in many types of structures, and labeled in inconsistent ways. For example, p-values from three different types of analyses might be labeled, “p", “pvalue", or “p.value". Luckily an add-on package named broom has commands that will convert the output you need into data frames, standardizing the names as it does so. From there you can do whatever additional analyses you need.
However, R’s complexity here is not without its advantages. For example, it would be relatively easy for me to do two regression models per group, save the output, and then do an additional set of per-group analyses that compared the two models statistically. That would be quite a challenge for most statistics packages (see the help file of the dplyr package’s “do” function for an example.)
Unclear Way to Clear Memory
Another unusually complex approach R takes is the way it clears its workspace in your computer’s memory. While a simple “clear" command would suffice, the R approach is to ask for a listing of all objects in memory and then delete that list: rm(list = ls() ). What knowledge is required to understand this command? You have to (1) know that “ls()" lists the objects in memory, (2) what a character vector is because (3) “ls()" will return the object names in a character vector, (4) that rm removes objects, (5) that the “list" argument is not really asking for information stored in an R list, but rather in a character vector. That’s a lot of things to know for such a basic command!
This approach does have its advantages though. The command that lists objects in memory has a powerful way to search for various patterns in the names of variables or data sets that you might like to delete.
Luckily, the popular RStudio front-end to R offers a broom icon that clears memory with a single click.
Lack of Graphical User Interface (GUI)
Like most other packages R’s full power is only accessible through programming. However unlike many others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SPSS and Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.
Lack of Beneficial Constraints
To this point, I have listed the main aspects of R that confuse beginners. Many of them are the result of R’s power and flexibility. R tightly integrates commands for (1) data management, (2) data analysis and graphics, (3) macro facilities (4) output management and (5) matrix algebra capabilities. That combination, along with its rich set of data structures and the ease with which anyone can extend its capabilities, make R a potent tool for analytics.
However, the converse of that is provided by its competitors, such as SAS, SPSS and Stata. A notable advantage that they offer is “beneficial constraints." These packages use one kind of input, provide one kind of output, provide very limited ways to select variables, provide one way to rename variables, provide one way to do by-group analyses, and are limited in a variety of other ways. Their macro and matrix languages, and output management systems, are separated enough as to be almost invisible to beginners and even intermediate users. Those constraints limit what you can do and how you can do it, making the software much easier to learn and use, especially for those who only occasionally need the software. They provide power that’s perfectly adequate to solve any problem for which they have pre-written solutions. And the set of solutions they provide is rich. (Stata also provides an extensive set of user-written solutions.)
There are many aspects of R that make it hard for beginners to learn. Some of these are due to R’s unique nature, some are due to its power, flexibility and extensibility. Others are due to aspects of its design that don’t really offer benefits. However, R’s power, free price, and open source nature attract developers, and the wide array of add-on tools have resulted in software that is growing rapidly in popularity.
R has been my main analytics tool of choice for close to a decade now, and I have found its many benefits to be well worth its peccadilloes. If you currently program in SAS, SPSS or Stata and find its use comfortable, then I recommend giving R a try to see how you like it. I hope this article will help prepare you for some of the pitfalls that you’re likely to run into by explaining how some of them offer long-term advantages, and by offering a selection of add-on packages to help ease your transition. However, if you use one of those languages and view it as challenging, then learning R may not be for you. Don’t fret, you can enjoy the simplicity of other software, while calling R functions from that software as I describe here.
If you’re already an R user and I’ve missed any of your pet peeves, please list them in the comments section below.