by Robert A. Muenchen
R has a reputation of being hard to learn. Some of that is due to the fact that it is radically different from other analytics software. Some is an unavoidable byproduct of its extreme power and flexibility. And, as with any software, some is due to design decisions that, in hindsight, could have been better.
If you have experience with other data science tools, you may at first find R very alien. Training and documentation that leverages your existing knowledge and which points out where your previous knowledge is likely to mislead you can save much frustration. This is the approach I use in my books, R for SAS and SPSS Users and R for Stata Users, as well as in the workshops that are based on those books.
Below is a list of complaints about R that I commonly hear from people taking my R workshops. By listing these, I hope R beginners will be forewarned, will become aware that many of these problems come with benefits, and may consider the solutions offered by the add-on packages that I suggest. As many have said, R makes easy things hard, and hard things easy. However, add-on packages help make the easy things easy as well.
Too Many Graphical User Interfaces (GUIs)
Like most other packages, R’s full power is only accessible through programming. However unlike many others, it does not offer a standard GUI to help non-programmers do analyses. You can read about the many available in this comparison article, but with so many to choose from, it further complicates the learning process. In addition, the existence of multiple GUIs means that the volunteer effort that is spent on them is spread across many versions instead of just one.
Unhelpful Help
R’s help files are often thorough and usually contain many working examples. However, they’re definitely not written for beginners! My favorite example of this is the help file for one of the first commands that beginners learn: print. The SAS help file for its print procedure says that it “Prints observations in a SAS data set using some or all of the variables.” Clear enough. The R help file for its print function says, “print prints its argument and returns it invisibly (via invisible(x)). It is a generic function which means that new printing methods can be easily added for new classes.” The reader is left to wonder what “invisible” output looks like and what methods and classes are. The help files will tell you more about “methods” but not “classes”. You have to know to look for help on “class” to find that.
Another confusing aspect to R’s help files stems from R’s ability to add new features to existing functions as you load add-on packages. This means you can’t simply read a help file, understand it, and you’re done learning that function forever. However, it does mean that you have fewer commands to learn. For example, once you learn to use the predict function, when you load a new package, that function may gain new abilities to deal with model objects that are computed specifically within the new package.
So an R beginner has to learn much more than a SAS or SPSS beginner before he or she will find the help files very useful. However, there is a vast array of tutorials, workshops and books available, many of them free, to get beginners over this hump.
Too Many Commands
Other data science packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite, which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, analysis of variance (ANOVA) table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There are other commands such as “coefficients” that display only that part of the model.
It is relatively easy to recognize the correct answer on a multiple-choice test, but much harder to recall it from scratch for an essay exam. While SAS/SPSS output may include much more than you wanted to see, it allows you to recognize the pieces of output you wanted, without having to recall the command to get it. R’s piecemeal approach to commands means you are more dependent upon recall, making it inherently harder to learn.
However, R’s focus is always on enabling each output to be used as input to further analysis. Its “get just what you ask for” approach makes this easy, and that attracts developers. People looking only to use methods programmed by others (the great majority) will then benefit from the vast array of packages that developers make available.
Misleading Function or Parameter Names
The most difficult time people have learning R is when functions don’t do the “obvious” thing. For example when sorting data, SAS, SPSS and Stata all use commands appropriately named “sort.” Turning to R, they look for such a command and, sure enough, there is one named exactly that. However, it does not sort data sets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the order function sorts data sets and it does so in a somewhat convoluted way. However, the dplyr package has an arrange function that sorts data sets and it is quite easy to use.
Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. Other packages require sorting before they can do three common tasks: (1) summarizing / aggregating data, (2) repeating an analysis for each group (“by” or “split-file” processing) and (3) merging files by key variables. R does not need the user to explicitly sort datasets before performing any of these tasks!
Another command that commonly confuses beginners is the simple “if” function. While it is used to recode variables (among other tasks) in other software, in R if controls the flow of commands, while ifelse performs tasks such as recoding.
Inconsistent Function Names
All languages have their inconsistencies. For example, it took SPSS developers decades before they finally offered a syntax-checking text editor. I was told by an SPSS insider that they would have done it sooner if the language hadn’t contained so many inconsistencies. SAS has its share of inconsistencies as well, with OUTPUT statements for some procedures and OUT options on others. However, I suspect that R probably has far more inconsistencies than most since it lacks a consistent naming convention. You see names in: alllowercase, period.separated, underscore_separated, lowerCamelCase and UpperCamelCase. Some of the built-in examples include:
names, colnames row.names, rownames rowSums, rowsum rowMeans, (no parallel rowmean exists) browseURL, contrib.url, fixup.package.URLs package.contents, packageStatus getMethod, getS3method read.csv and write.csv, load and save, readRDS and saveRDS Sys.time, system.time
When you include add-on packages, you can come across some real “whoppers!” For example, R has a built-in reshape function, the Hmisc package has a reShape function (case matters), and there are both reshape and reshape2 packages that reshape data, but neither of them contain a function named “reshape”! The most popular R package for reshaping data is tidyr, with its gather and spread commands.
Inconsistent Syntax
Since everyone is free to add new capabilities to R, the resulting code for different R packages is often a bit of a mess. For example, the two blocks of code below do roughly the same thing but using radically different syntaxes. This type of inconsistency is common in R, but there’s no way to get around it given everyone can add to it as they like.
library("Deducer") descriptive.table( vars = d(mpg,hp), data = mtcars, func.names =c("Mean", "Median", "St. Deviation", "Valid N", "25th Percentile", "75th Percentile"))
library("RcmdrMisc") numSummary( data.frame(mtcars$mpg, mtcars$hp), statistics = c("mean", "sd", "quantiles"), quantiles = c(.25, .50, .75))
Dangerous Control of Variables
In R, a single analysis can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they were convinced had the exact same observations in precisely the same order. However, a quick check on the files showed that they usually did not match up! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses, R seems to be asking for disaster and I recommend merging files when possible by key variables before doing an analysis whenever possible.
Why does R allow such a dangerous operation? Because it provides useful flexibility. For example, from one dataset, you might plot regression lines of variable x against variable y for each of three groups on the same plot. From another dataset you might get group labels to add directly on the plot. This lets you avoid a legend that makes your readers look back and forth between the legend and the lines. The label dataset would contain only three variables: the group labels and their x-y locations. That’s a dataset of only 3 observations so merging that with the main data set makes little sense.
Inconsistent Ways to Analyze Multiple Variables
One of the first functions beginners typically learn is summary(x). As you might guess, it gets summary statistics for the variable x. That’s simple enough. You might guess that to analyze two variables, you would just enter summary(x, y). However, many functions in R, including this one, accept only single objects. The solution is to put those two variables into a single object such as a data frame: summary(data.frame(x,y)). So the generalization you need to make is not from one variable to multiple variables, but rather from one object (a variable) to another object (a dataset).
If that were the whole story, it would not be that hard to learn. Unfortunately, R functions are quite inconsistent in both what objects they accept and how many. In contrast to the summary example above, R’s max function can accept any number of variables separated by commas. However, its cor function cannot; they must be in a matrix or a data frame. R’s mean function accepts only a single variable and cannot directly handle multiple variables even if they are in a single data frame. The popular graphing package ggplot2 does not accept variables unless they’re combined into a data frame. These frustrating inconsistencies simply need to be memorized.
Overly Complicated Variable Naming and Renaming Process
People are often surprised when they learn how R names and renames its variables. Since R stores the names in a character vector, renaming even one variable means that you must first locate the name in that vector before you can put the new name in that position. That’s much more complicated than the simple newName=oldName form used by many other languages.
While this approach is more complicated, it offers great benefits. For example, you can easily copy all the names from one dataset into another. You can also use the full range of string manipulations (such as regular expressions) allowing you to use many different approaches to changing names. Those are capabilities that are either impossible or much more difficult to perform in other software.
If the data are coming from a text file, I recommend simply adding the names to a header line at the top of the file. If you need to rename them later, I recommend the dplyr package’s rename function. You can switch to R’s built-in approach when you need more flexibility.
Poor Ability to Select Variables
Most data science packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z), that share a common prefix (e.g. varA, varB,…) or that contain numeric suffixes (e.g. x1-x25, not necessarily next to one another). R generally lacks a built-in ability to make these selections easily, but the dplyr package’s select function is both easy and powerful.
R’s built-in approach does have offer a significant advantage. It is far more flexibility in variable selection than other software when you use a bit of additional programming. However, those shortcuts and more are provided by the dplyr package’s select function.
Too Many Ways to Select Variables
If variable x is stored in mydata and you want to get the mean of x, most software only offers you one way to do that, such as “VAR x;” in SAS. In R, you can do it in many different ways:
summary(mydata$x) summary(mydata$"x") summary(mydata["x"]) summary(mydata[,"x"]) summary(mydata[["x"]]) summary(mydata[1]) summary(mydata[,1]) summary(mydata[[1]]) with(mydata, summary(x)) attach(mydata) summary(x) summary(subset(mydata, select=x)) library("dplyr") summary(select(mydata, x)) mydata %>% with(summary(x)) mydata %>% summary(.$x) mydata %$% summary(x)
To add to the complexity, if we simply asked for the mean instead of several summary statistics, several of those approaches would generate error messages because they select x in a way that the mean function will not accept.
To make matters even worse, the above examples work with data stored in a data frame (what most software calls a dataset) but not all them work when the data are stored in a matrix.
Why are there so many ways to select variables? R has many data structures that give it great flexibility, and each can use slightly different approaches to variable selection. When you fully integrate matrix algebra language capabilities into the main language, rather than have it exist as a separate language such as SAS/IML, this too requires that you see more of what is often hidden, until you need it, in other software. The last few of examples above come from the dplyr package, which makes variable selection much easier, but of course that also means having to learn more.
Too Many Ways to Transform Variables
Data science software typically offers one way to transform variables. SAS has its data step. SPSS has the COMPUTE statement and so on. R has several approaches, some of which are shown below. Here are some ways R can create a new variable named “mysum” by adding two variables, x and y:
mydata$mysum <- mydata$x + mydata$y mydata$mysum <- with(mydata, x + y) mydata["mysum"] <- mydata["x"] + mydata["y"] attach(mydata) mydata$mysum <- x + y detach(mydata) mydata <- within(mydata, mysum <- x + y ) mydata <- transform(mydata, mysum = x + y) library("dplyr") mydata <- mutate(mydata, mysum = x + y) mydata <- mydata %>% mutate(mysum = x + y)
Some are variations variable selection methods, such as the use of the attach function which, for transformation purposes, also requires the use of the “mydata$” prefix (or an equivalent form) to ensure that the new variable gets stored in the original data frame. Leaving that out leads to a major source of confusion as beginners assume the new variable will go into the attached data frame (it won’t!) The use of the “within” function parallels the use of the similar “with” function for variable selection, but it allows for variable modification while “with” does not.
The cost of this situation is clear. The benefit comes from the integration of multiple types of commands (macro, matrix, etc.) and data structures.
Not All Functions Accept All Variables
In most data science software, a variable is a variable, and all procedures accept them. In R however, a variable could be a vector, a factor, a member of a data frame or even a component of a complex structure in R called a list. For each function you have to learn what it will accept for processing. For example, most simple statistical functions for the mean, median, etc. will accept variables stored as vectors. They’ll also accept variables in datasets or lists, but only if you select them in such a way that they become vectors on the fly.
This complexity is again the unavoidable byproduct of R’s powerful set of data structures which includes vectors, factors, matrices, data frames, arrays and lists (more on this later). Despite adding complexity, it offers a wide array of advantages. For example, categorical variables that are stored as factors can be included in a regression model and R will automatically generate the dummy or indicator variables needed to make such a variable work well in the model.
Confusing Name References
Names must sometimes be enclosed in quotes, but at other times must not be. For example, to install a package and load it into memory you can use these inconsistent steps:
install.packages("dplyr") library(dplyr)
You can make those two commands consistent by adding quotes around “dplyr” in the second command, but not by removing them from the first. To remove the package from memory you use the following command, which refers to the package in yet another way (quotes optional):
detach(package:dplyr)
Poor Ability to Select Data Sets
The first task in any data analysis is selecting a data set to work with. Most other software have ways to specify the data set to use that is (1) easy, (2) safe and (3) consistent. R offers several ways to select a data set, but none that meets all three criteria. Referring to variables as mydata$myvar works in many situations, but it’s not easy as you end up typing “mydata” over and over. R has an attach function, but its use is quite tricky, giving beginners the impression that new variables will be stored there (by default, they won’t) and that R will pick variables from that dataset before it looks elsewhere (it won’t). Some functions offer a data argument, but not all do. Even when a function offers that argument, it only works if you specify a model formula too (e.g. paired tests don’t need formulas).
If this is so easy in other software but so confusing in R, what’s the point? Part of it is the price to pay for great flexibility. R is the only software I know of that allows you to include variables from multiple datasets in a single analysis. So you need ways to change datasets in the middle of an analysis. However, part of that may simply be design choices that could have been better in hindsight. For example, it could have been designed with a data argument in all functions, with a system-wide option to look for a default data set, as SAS has.
Loop-a-phobia
R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses its “apply” family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same. The benefit that this offers is that it’s sometimes easier to write and understand apply functions.
Functions That Act Like Procedures
Many other packages, including SAS, SPSS, and Stata have procedures or commands that do typical data analyses which go “down” through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and many functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force them to go in either direction. So it’s true that in R, functions act like both procedures and functions from other packages. If you’re coming to R from other software, that’s a radically new approach.
Odd Treatment of Missing Values
In all data analysis packages that I’m aware of, missing values are treated the same: they’re excluded automatically when (1) selecting observations and (2) performing analyses. When selecting observations, R actually inserts missing values! For example, say you have this data set:
Gender English Math male 85 82 male 72 87 75 81 female 77 78 female 98 91
If you select the males using mydata[mydata$gender == “male”, ] R will return the top three lines, and substitute its missing value symbol, NA, in place of the values 75 and 81 for the third observation. Why create missing where there were none before? It’s as if the R designers considered missing values to be unusual and thought you needed to be warned of their existence. In my experience, missing values are so common that when I get a data set that appears to have none, I’m quite suspicious that someone has failed to set some special code to be missing. There’s a solution to this in R, which is to ask for “which” of the observations is the logic true with mydata[which(mydata$gender == “male”), ]. The dplyr package also avoids this with the function filter(mydata, gender == “male”).
When performing more complex analyses, using what R calls “modeling functions,” missing values are excluded automatically. However, when it comes to simple functions such as mean or median, it does the reverse by returning a missing value result if the variable contains even a single missing value. You get around that my specifying na.rm=TRUE on every function call, but why should you have to? While there are system-wide options you can set for many things such as width of output lines, there’s no option to avoid this annoyance. It’s not hard to create your own function that has missing values removed by default, but that seems like overkill for such a simple annoyance.
Neither of these conditions seems to offer any particular benefit to R. They’re minor inconveniences that R users learn to live with.
Odd Way of Counting Valid or Missing Values
The one function that could really benefit from excluding missing values, the length function, cannot exclude them! While most package include a function named something like n or nvalid, R’s approach to counting valid responses is to (1) check if each value is missing with the is.na function, then (2) use the not symbol “!” to find the non-missing. You have to know that (3) this generates a vector of true/false values that have numeric value of 1/0 respectively. Then you add them up (4) with the sum function. That’s an awful lot of complexity compared to n(x). However, it’s easy to define your own n function or you can use add-on packages that already contain one, such as the prettyR package’s n.valid function.
Too Many Data Structures
As previously mentioned, R has vectors, factors, matrices, arrays, data frames (datasets) and lists. And that’s just for starters! Modeling functions create many variations on these structures and they also create whole new ones. Users are free to create their own data structures, and some of these have become quite popular. Along with all these structures comes a set of conversion functions that switch an object’s structure from one type to another, when possible. Given that so many other analytics packages get by with just one structure, the dataset, why go to all this trouble? If you added the various data structures that exist in other packages’ matrix languages, you would see a similar amount of complexity. Additional power requires additional complexity.
Warped Dimensions
Two-dimensional objects easily become one-dimensional objects. For example, this way of selecting males uses the variable “Gender” as part of a two-dimensional data frame, so you need to have a comma follow the logical selection:
with(Talent[Talent$Gender == "Male", ], summary( data.frame(English, Reading) ) )
But in this very similar way of getting the same thing, Gender is selected as a one-dimensional vector and so adding the comma (which implies a second dimension) would generate an error message:
with(Talent, summary( data.frame( English[Gender == "Male"], Reading[Gender == "Male"] ) )
Sparse Output
R’s output is often quite sparse. For example, when doing cross-tabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a cross-tabulation is as simple as barplot(table(a,b)). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages. Some of them are shown here.
Unformatted Output
The default output from SAS and SPSS is nicely formatted making it easy to read and paste your word processor as fully editable tables. All future output can be set to journal style (or many other styles) by simply setting an option. R not only lacks true tabular output by default, it does not even provide tabs between columns. So beginners commonly paste the output into their word processor and select a mono-spaced font to keep the columns aligned.
R does have the ability to get its output looking good, but it does so using additional packages such as compareGroups, knitr, pander, sjPlot, sweave, texreg, xtable, and others. In fact, its ability to display output in complex tables (e.g. showing multiple models) is better than any other package I’ve seen.
Complex By-Group Analysis
Most packages let you repeat any analysis simply by adding a command like “by group” to it. R’s built-in approach is far more complex. It requires you to create a macro-like function that does the analysis steps you need. Then you apply that function by group. Other languages let you avoid learning that type of programming until you’re doing more complex tasks. However, the deeper integration of such macro-like facilities in R means that the functions you write are much more integrated into the complete system.
Complex & Inconsistent Output Management
Strongly related to by-group processing (above) is output management. Printed output from an analysis isn’t always the desired end result. Output often requires further processing. For example, at my university we routinely do the same salary regression model for each of over 250 departments. We don’t care about most of that output, we only want to see results for the few departments whose salaries seem to depend on gender or ethnicity. We’re hoping to find none of course!
Using base R, here are the steps required:
1. Create a function that does the regression.
2. Apply the function using one of: by, tapply, plyr’s dlply.
3. Save the output to a list.
4. Study the structure of that list to figure out where the needed values are stored.
5. Write an extractor function to pull out the needed values.
6. Apply that function using lapply or plyr’s ldply to create a useful data frame.
While these are the general steps, each type of model function creates a list that’s unique, with pieces of output stored inside the list in many types of structures, and labeled in inconsistent ways. For example, p-values from three different types of analysis might be labeled, “p”, “pvalue”, or “p.value”. Luckily an add-on package named broom has commands that will convert the output you need into data frames, standardizing the names as it does so.
R’s complexity here is not without its advantages. For example, it would be relatively easy to do two regression models per group, save the output, and then do an additional set of per-group analysis that compared the two models statistically. That would be quite a challenge for most statistics packages (see the help file of the dplyr package’s “do” function for an example.)
Unclear Way to Clear Memory
Another unusually complex approach R takes is the way it clears its workspace in your computer’s memory. While a simple “clear” command would suffice, the R approach is to ask for a listing of all objects in memory and then delete that list: rm(list = ls() ). What knowledge is required to understand this command? You have to (1) know that “ls()” lists the objects in memory, (2) what a character vector is because (3) “ls()” will return the object names in a character vector, (4) that rm removes objects, (5) that the “list” argument is not really asking for information stored in an R list, but rather in a character vector. That’s a lot of things to know for such a basic command!
This approach does have its advantages though. The command that lists objects in memory has a powerful way to search for various patterns in the names of variables or data sets that you might like to delete.
Luckily, the popular RStudio front-end to R offers a broom icon that clears memory with a single click.
Identity Crisis
All analytics software has names for their variables, but R is unique in the fact that it also names its rows. This means you must learn how to manage row names as well as variable names. For example, when reading a comma-separated-values file, variable names often appear as the first row, and all analytics software can read those names. An identifier variable such as ID, Personnel_Num, etc. often appears in the first column of such files. Using R’s read.csv() function, if that variable is not named, R will assume that it’s an ID variable and it will convert its values into row names. However if it is named – as all other software would require – then you add some options to tell R to put its values, or the values of any other variable you choose, into the row names position. Once you do that, the name of the original ID variable vanishes. The benefit to this odd behavior is that when analyses or plots need to identify observations, R automatically knows where to get those names. This saves R from needing the SAS equivalent to the ID statement (e.g. in PROC CLUSTER).
While this looks like a worthwhile trade-off, it is complicated by the fact that row names must be unique. That means you cannot maintain the original row names when you stack two files that have the same variables if you measured the same observations at two times, perhaps before and after some treatment. It also means that combining by-group output can be tricky, though the broom package takes that into account for you. The popular dplyr package replaces row names with character values of the consecutive integers, 1, 2, 3…. My advice is to handle your own ID variables as standard variables and put them into the row names position only when an R function offers some you some benefit in return. (For a dplyr example, see The Tidyverse Curse below.)
The Tidyverse Curse
There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is in the 99.99th percentile of downloads (as of 7/20/2019.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main tidyverse packages (excluding ggplot2) contain over 700 functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.
A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:
> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...
We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).
> library("dplyr") > mtcars %>% + as_tibble() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:
> library("dplyr") > mtcars %>% + as_tibble() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"
Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)
> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.
In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.
Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.
This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.
Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used
The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:
> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...
These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort. For a more detailed critique of the tidyverse, read the TidyverseSkeptic. A helpful set of common tasks done in both Base R and tidyverse commands is listed here.
Lack of Beneficial Constraints
To this point, I have listed the main aspects of R that confuse beginners. Many of them are the result of R’s power and flexibility. R tightly integrates commands for (1) data management, (2) data analysis and graphics, (3) macro facilities (4) output management and (5) matrix algebra capabilities. That combination, along with its rich set of data structures and the ease with which anyone can extend its capabilities, make R a potent tool for data science.
However, the converse of that is provided by its competitors, such as SAS, SPSS and Stata. A notable advantage that they offer is “beneficial constraints.” These packages use one kind of input, provide one kind of output, provide very limited ways to select variables, provide one way to rename variables, provide one way to identify cases, provide one way to do by-group analyses, and are limited in a variety of other ways. Their macro and matrix languages, and output management systems, are separated enough as to be almost invisible to beginners and even intermediate users.
Those constraints limit what you can do and how you can do it, making the software much easier to learn and use, especially for those who only occasionally need the software. They provide power that’s perfectly adequate to solve any problem for which they have pre-written solutions. And the set of solutions they provide is rich. (Stata also provides an extensive set of user-written solutions.)
Conclusion
There are many aspects of R that make it hard for beginners to learn. Some of these are due to R’s unique nature, some are due to its power, flexibility and extensibility. Others are due to aspects of its design that don’t really offer benefits. However, R’s power, free price, and open source nature attract developers, and the wide array of add-on tools have resulted in software that is growing rapidly in popularity.
R has been my main data science tool of choice for over a decade now, and I have found its many benefits to be well worth its peccadilloes. If you currently program in SAS, SPSS, or Stata and find its use comfortable, then I recommend giving R a try to see how you like it. I hope this article will help prepare you for some of the pitfalls that you’re likely to run into by explaining how some of them offer long-term advantages, and by offering a selection of add-on packages to help ease your transition.
However, if you use one of those languages and view it as challenging, then learning R may not be for you. Don’t fret, you can enjoy the simplicity of other software, while calling R functions from that software as I describe here. If you’re using menus and dialogs in software like JMP or SPSS, you can still take advantage of R’s power through one of its graphical user interfaces.
If you’re already an R user and I’ve missed any of your pet peeves, please list them in the comments section below.
If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.
Acknowledgements
Thanks to Patrick Burns, Tal Galili, Joshua M. Price, Drew Schmidt, and Kevin Wright for their suggestions that improved this article.
A very valuable article Bob, thanks so much!
I could go into many problems with other traditional analytics languages and even new analytics tools, but those are all created by software companies with centralized control and paid developers. Frankly, it’s amazing how well many of the libraries of R work together while maintaining much of the original flexibility!
I know that many of the current leaders in the R development world are attempting to address these issues with “modernized”, more broadly comprehensible libraries suitable for traditional R users and less technical types. I am grateful for the incredible effort invested by so many to make R what it is today – simply the best platform for addressing a wide range of real-world analytic issues in a deep, comprehensive manner.
Hi Stephen,
I’m glad you liked it. It is an amazing achievement that something extended by so many people works as well as it does. To counterbalance this I should write a condensed version of Chapter 1 of my books on “Why R is Awesome!”
Cheers,
Bob
Ironically, by addressing the frustrations of R in this article, you will help many more see that “R is Awesome!” You will also help improve the adoption rate and decrease the chances of incorrect analyses.
I hope so! More importantly, I hope it helps people choose the tool that’s right for them.
Thanks for the article. I think there is a missing closing parenthesis in
summary(mydata$”x”
Hi Beliavsky,
Nice catch, thanks!
Cheers,
Bob
I agree with your complaint about un-formatted output. There’s many aspects to that, one of which is the over-use of scientific format for numbers. I created a package called “lucid” that provides a function for cleaner printing of floating point numbers.
A short vignette provides examples:
http://cran.r-project.org/web/packages/lucid/lucid.pdf
Oops, this is the lucid vignette:
http://cran.r-project.org/web/packages/lucid/vignettes/lucid.pdf
Hi Kevin,
I really like your lucid package & strongly recommend it, especially for people who deal with very large and very small numbers.
Cheers,
Bob
Nice article! As a side note: For “mydata %$% summary(x)” to work, you’ll (currently) need to load the magrittr package (>= 1.5). With dplyr (0.3.0.2) alone, it won’t work.
Hi beginneR,
Since you know that, you clearly are not an R beginneR. Thanks for pointing it out. This is up-to-the-minute info!
Cheers,
Bob
Excellent article!
Hi Thomas,
Thanks for the kind feedback!
Cheers,
Bob
Do you have any recommendations for resources to learn R? In my field, it is valuable to learn R, but I am still stuck on Excel. I know I can pick it up, but need the right guide! That is, in addition to your work! I am looking for any resources possible. Thank you.
Hi Kevin,
I’ve listed my favorite references in the “blogroll” on the right side of the screen towards the top. Those sites are all excellent sources of R info.
Cheers,
Bob
Thank you. Enjoyed the article. Great work.
brilliant work. your writing is an inspiration. very nicely done
Hi Ajay,
Thanks! I just saw your second book is about to be published. Way to go! Keep up your good work at http://Decisionstats.com.
Cheers,
Bob
Hi Bob,
I really enjoyed the article and found it very informative – I was nodding my head and agreeing when I read it.
Cheers
Ken
Hi Ken,
I’m glad you found it informative. It’s the collective frustration of thousands of workshop attendees!
Cheers,
Bob
Excellent article. In fact it provides a very good bird’s eye view of R. I found that in order to be a good programmer of any lang , you need to be critical of some features and see how you would have written those constructs. Coming from other series of programming languages, I found R a weird language to start with. But as and when you start focusing on the analytics portion , you will find that its advantages far outweigh the disadvantages. With support packages like shiny , the brilliant plyr , ggplot etc , developing data products is becoming easy. With more packages to integrate with other back end languages( C, C++ ..etc) including cloud systems and the front end web based languages or other scripts, R can be a very good sandwich for analytics. I personally feel (not sure how possible it is) it should start its focus on how to utilize the set of features available on other languages / scripts , provide a pipe rather than redoing the entirety on its own in trying to become an all in one analytics tool.
Thank you for your sharing your insights.
Hi Sriram,
Thanks for your comments. I also found R to be weird to start with. I would alternate between thinking it was horrible one day and then discovering a feature that was just brilliant the next. For people who are happy programming, it’s well worth the time it takes to get used to it. For people who prefer to point-and-click (many of you SPSS users know who I’m talking about) then R’s probably not a good choice.
Cheers,
Bob
Bob:
Good article — I have printed it to review again later.
The Coursera specialist stream seems like a good idea for those needing helo with a beginning effort at programming.
Hi WillR,
Thanks for the tip!
Cheers,
Bob
Excellent article!
Random trivia, around the “Inconsistent Function Names” discussion – I think this comes from two problems. The first is we still have no style guide, a la pep8, and so naming is always going to be markedly inconsistent between libraries and even within core. The second is that the sheer amount of evolution R’s syntax has gone through creates a vast array of naming conventions within base R, each one adapted to avoid (now-historical) problems.
For example: mean(x, na.rm = FALSE). Why na.rm? Why use the full stop, rather than an underscore? Well, in S-PLUS (and early R), the underscore was used for assignation – and even though that’s no longer the case, a lot of R’s functionality dates from that period (or earlier, because it was ported from S). So that’s no longer a problem programmers have to solve, but it’s a problem a lot of code is written around – and people learn the styles of a language by looking at how the people before them implemented things, whether or not it makes things readable. And then some people use underscores to try to increase readability, and.. http://xkcd.com/927/
Hi Oliver,
Good point & hilarious cartoon!
Cheers,
Bob
Bob, thank you for all your efforts at instruction in R.
What would it take to write a package, perhaps called erroR, so when some call goes south, you could enter guideme(), the core function of erroR, and R would take the cryptic error message and explain something in human terms about what it is telling the user and what might be some solutions.
This would interest me quite a bit to help on.
Hi Rees,
I think it would be a herculean task. Often the messages make no sense unless you dig inside the function returning the message to see that it’s talking about some other function call within that function. You’d be quite the hero if you could pull it off though!
Cheers,
Bob
I agree with Stephen: if you read the comments in detail, it shows that in some (most..) aspects R is in fact awesome. However, the way the article is written rather suggests that it is awful, which I regret. If users of other tools are confused because their approaches are simply not needed any more (like the “sort” example), it’s not the fault of R. And I am so thankful that I can address my variable by name or column as it suits the setting best, and that I do not have to use an “ifelse” for recoding, and that it warns me about missing values instead of just letting them drop out more or less silently, just as a few examples.
Hi Stefanie,
I agree that R is awesome. I use it for all my own analysis. I hope that by pointing out the aspects of it that cause beginners frustration that they’ll be able to get past those areas more easily. On the other hand, if they view a simpler language as being difficult, then they’re probably not going to be happy switching to R.
Cheers,
Bob
Closures have been badly implemented. Here is a sample taken from https://class.coursera.org/rprog-014/quiz:
Consider the following function
f <- function(x) {
g <- function(y) {
y + z
}
z <- 4
x + g(x)
}
z <- 10
f(3)
I would have expected z in g to be 10 or even better to indicate undefined variable z. But it is 4; because variables are mutable. Does not look like lexical scoping to me.
Hi Joe,
Thanks for the example!
Cheers,
Bob
Why would you expect z to be 10 in g? g’s parent environment is the environment of f, in which z is defined to be 4. Could you explain further how this constitutes a “bad” implementation of closures?
Nice article! I am a beginner in R. I was started learning R to do some statistical work in my research. But I did find some inconsistencies in it which are different from other programming languages. One of them is the function to give names to data using names() function. In most programming languages, association is targeted to a variable, but in R, to give names to data you use “names(var) <- c(some names)" which is very unusual and irrasional in common programming language.
Anyway, nice article!
Hi Yunarsoa,
That surprised me too when I started with R. However, when I need to copy a set of names from one data set to another, it’s fantastic!
names(NewDataSet) <- names(OldDataSet)
Cheers,
Bob
Great overview, I also nodded often.
But, how to deal with it?
I teach R for about 6 years now in a class compulsory for psychology students. This year I changed the paradigm.
a) Reproducible Research, e.g. Markdown documents within RStudio from lesson 1
b) dplyr and tidyr from lesson 1
c) Don’t be afraid to use functions (own functions)
Right now (lesson 9) we did something like:
presentAndPassed = 1, na.rm = TRUE)
Class %>%
select( starts_with(“Points”, starts_with( “week”) ) %>%
transmute ( Mn = apply( ., mean, na.rm = TRUE ),
present = apply( ., presentAndPassed ) %>%
data.frame( Class, . ) -> Class
We do a lot of drawing, since this approach follows set theory. I was surprised that you could show code like above and people could read it off and explain it like a story . And–
* you almost never have more than 2 closing parentheses
* you almost never have temporary objects
* you can develop/use the name of your main dataframe continuously since in a document to be knitted it will always develop top to bottom
However, sometimes we will have to leave this approach, the psych package, that we heavily use is not really compatible, thus we prepare the data outside and then jump in…
One remark at the end. At the beginning I taught R to former SPSS users, thus I had to mimic SPSS all the time to make it seem comparable. Now I use some operations right from the beginning, that are very cumbersome to do with SPSS like create random variables.
The development R takes is impressive.
All the best to all of you,
Walter.
Hi Walter,
Thanks for all your interesting comments!
Cheers,
Bob
Hi Bob,
Thanks for sharing your insights and for the wonderfully-written piece. You have a gift for simplifying an otherwise dry topic.
I’ve been using SAS for a few years now for statistical analyses in public health research (regressions, survival analyses, etc.) and want to explore R as an alternative.
Would you be kind enough to suggest a “step-by-step” plan for someone looking to learn R from scratch? What resources would you recommend? I learn best with ample examples and have found Ron Cody’s books really helpful when I was learning SAS. I don’t have time to enroll in any courses that require physically attending classes, unfortunately.
Thanks again!
Hi Random Dude,
Love your name! My book, R for SAS and SPSS Users, is aimed at people who know either SAS or SPSS (they’re similar in many ways) and who are starting to learn R. It starts at ground zero and presents things in easy-to-absorb steps. The code for almost everything is shown in both R and SAS so you can compare them. Much of the trouble I had learning R was due to the fact that I kept expecting it to work like SAS, so I start every topic by saying how SAS (and SPSS) does it, then move on to how R does it and why. I think you’d enjoy the perspective.
Much of the material is also covered in this course at DataCamp.com: https://www.datacamp.com/courses/r-for-sas-spss-and-stata-users-r-tutorial. I also teach that topic on visits to organizations that are migrating to R from SAS, SPSS, or Stata.
Cheers,
Bob
Thank you for this – it validates many of my opinions. I enjoy using R and find it to be very useful and unique. However, as someone with an advanced degree in human computer interaction, using R is often maddening. There are endless violations of consistency and other basic principles of user interaction. Python is an interesting contrast in being a language that strives to be as simple and consistent as possible while still being extremely powerful.
Hi Fred,
I think Python’s an excellent language. However, it’s way behind R in terms of contributed packages. The Julia language has much of Python’s simplicity and consistency, and is much faster. It will be interesting to see how they compete in the future.
Cheers,
Bob
Having used R daily for several years now, I can’t believe this is my first time coming across this excellent article. The points you lay out here are things an experienced user might easily forget when training new comers to the language.
I have to take issue with your remark in the “Odd Treatment of Missing Values” section, where you claim “Neither of these conditions seems to offer any particular benefit to R”. Personally, I’ve found the way that the summary and comparison functions such as `mean`, `==`, `sum`, `sd`, etc. draw attention to missing values to be very useful in complex data pipelines, where, say, your initial data set may be NA-free, but subsequent processed iterations of which could result in NAs. If your processing generates NAs, the behavior of these functions forces you to recognize the NAs and explicitly address them in the code. If I see NAs in my summarized output that I was not expecting, I know something went wrong and I need to back track. Furthermore, having explicit `na.rm=TRUE` arguments in your code signals to someone else reading it that you know about the possibility of missing values and have accounted for it, which saves them the trouble of verifying this.
In the row indexing case you describe, I think R’s behavior is right if you think about it. Checking equality against an NA value should return NA. NA is missing data, so we cannot know whether equality does or does not hold. And row indexing a data.frame with NA ought to give you a row with NAs for all variables. You wouldn’t want it to give you 0 rows. That isn’t what you asked for. I think the confusion comes from the fact that people tend to use the equality operator as a singleton “element of” operator. But that’s not what it is. When they get used to thinking about all atomic values as containers, then the correct way of indexing with missing values becomes obvious: `mydata[mydata$Gender %in% ‘male’, ]`
Hi Matthew,
Thanks for your well-argued comment! I agree that the way R does things makes sense, my main complaint is that SAS and SPSS set a standard (like it or not) that all other stat packages followed. Since R took a different route, it ends up being another thing to re-learn.
Cheers,
Bob
Hi,
This is undoubtedly a worth reading article. But, as surprising as it may sound, I feel sad now that I’ve finished it. Especially for one of its final conclusions: “However, if you use one of those languages and view it as challenging, then learning R may not be for you”. This is actually exactly what I’m currently experimenting during my six months internship. Just to put the things in perspective, one month ago, I had the choice between the internship that I’m doing now, dealing with data mining techniques for sales and customer relationship and another one dealing with the construction of a database system. The first is based mainly on the use of the R software while the second guaranteed a formation on the SAS software. I’ve decide to choose the first one for its rich statistical content, starting from the assumption that it would be eventually more rewarding. Now that I’ve done one month, it’s a pure catastrophe. Indeed, I’ve had to learn the use of the R software on my own without the help of my tutor and the result, as this articles tries to point out, is that it’s very hard to become a proficient user of this software. The problem is that there were too much expectations on this side and in all likelihood? I’m not going to be sufficiently ready. Now that I think back, I should have chosen the first one, it would have enabled me to become more at ease with the use of SAS, which is quite a mandatory skill nowadays in the job market. I’m very pessimistic concerning my chances to get a job with such an experience. Wish I’d read this article before to choose.
Thank you however for this reading,
Salim
Hi Salim,
Sorry you’re having such a hard time. SAS is a nice, comfortable, padded cell. With only one type of data structure (the data set), just a couple of ways to select variables, all procs accept all variables regardless of how you selected them, it misses many of R’s headaches. Many people don’t need the extra power that R provides. Every one of us is good at some things and not so good at others. R requires much more programming skill, which is one of the reasons why people still pay SAS Institute so much money for something that could be free.
Good luck,
Bob
This is a great article. I expected to see a list of 3-5 items, followed by short descriptions. Having read other articles on your site, I should have known better!
You list many more issues than I had thought of, but I think something that isn’t as easy to nail down in a bullet is that how a user needs to “think” about R is quite different than how someone thinks about other analytical packages. I say this as someone w a decade of experience using STATA, SPSS, Excel, SAS, and now R. I don’t have a formal background in math or computer science, and I have come to see this as the biggest issue with me and learning R.
The problem isn’t as much that the syntax is substantially different, although it can be, it’s that (if you ask me) the way it’s thought about and organized is substantially different.
When I write a script to automate a task in say, STATA, I think of it in terms of a series of commands. The idea of functions, vectors, expressions, arguments, evaluation…that never enters my head. I realize that some/all of that is actually happening in the background, but the documentation doesn’t present it that way, so I have never, in ten + years, thought about it that way.
Given that R is a language moreso than other packages, the ‘why’ it’s thought about this way makes sense, but for someone wo the math/CSci background, we’ve never seen it presented that way. And the fact that it’s free, there is no incentive for anyone to dumb it down enough for someone wo an analytical background to jump in and use it. This has become more apparent recently as more and more people who use pay to play analytical tools for tasks with a difficulty of 4/10 start using R because of cost and flexibility. It’s not just for high level users, anymore.
Great article as I said, I just wanted to add that part, as I see so often in forums people talking about how R isn’t that hard to learn, and it turns out they often have that programmy background that as you point out at the beginning, means their thoughts and answers aren’t for beginners.
Thank you again for the great article.
Hi Adam,
Thanks for your very interesting comments. You bring up a point that is similar to the difference between recognition and recall. It’s relatively to recognize the correct answer on a multiple-choice test, but much harder to recall it from scratch for an essay exam. Applying this to software, the menu-based packages depend almost exclusively on recall, making them easy to use (though difficult to automate). When using the commercial packages like SAS or Stata (or the SPSS syntax), you have to recall just a few commands, the equivalent to “do regression” for example, and it disgorges what R aficionados deride: everything related to the model whether you wanted to see it or not. While it may include more than you wanted to see, it puts you in directly into a situation where recognition is all you need. R’s piecemeal approach to commands means you’re much more dependent upon recall, making it inherently harder to learn. However, R’s focus is always on enabling further analysis with the results, so the benefit from this perspective is each output is easier to use as input to the next step, something that benefits developers. People looking only to use methods programmed by others (the great majority) will then benefit from the vast array of packages that developers make available.
That’s an important point that I missed completely in the original version. I’ve just added it. Thanks for the inspiration!
Cheers,
Bob
I only started learning and using R six months ago, and even as a novice user I’ve noticed and have become concerned with what feels like a splitting language, as you mentioned above.
Could you speak more to this in future posts? Or if you know of other places where this discussion has happened?
Hi Austin,
You’re probably the first person to read that section since I wrote it the day before yesterday and haven’t blogged about it yet. That’ll be my next step and I’m sure there will be plenty of discussion about it. I haven’t seen any Internet discussion of the issue, I’ve just been talking about it with my colleague, Josh Price.
Cheers,
Bob
Hi Bob,
I can’t believe that in the 4+ years that I’ve been learning R I haven’t come across this page before. It makes me feel so much less “dumb”! I’ve been taking a data science course on Coursera that uses R that is taught by 3 different professors. Each has their own favorite set of tools to do the same data munging, which has led me to wonder if I am learning data science or if I am needing to learn the many flavors of R in order to learn concepts in data science. This post helps me understand that my sense of being lost in the course sometimes when a new concept doesn’t build on a previously presented R toolset but introduces an entirely different toolset that is he favorite of the instructor.
Thanks again and I will bookmark this page for the future.
Cheers,
Paul
Hi Paul,
That sounds like quite a challenge! I’ll bet the data.table package is used by one of those professors. I thought about mentioning that one, since it too offers wonderful data wrangling abilities. Its functions work much more like extensions to base R rather than a whole new language.
I’ve learned a LOT of data science tools throughout my 35 year career, and most of them seemed quite easy to pick up. Learning R however, was full of frustrations. Now that I’m comfortable with it, I really like it, but learning it was tough. I frequently see people writing that learning R is easy. I can only assume that 1) they don’t know it well enough yet to be aware of all its inconsistencies or 2) they know it well but are repressing their early traumatic memories (kidding, but only slightly), or 3) they never learned a tool that is easier to use.
Cheers,
Bob
Thank you for this. I’m new to programming, not just R, and this definitely makes me feel like my confusion is somewhat justified.
Do you happen to have any recommendations for tutorials or other resources that can help me get a better handle on the data structures and variable types. Maybe it’s because I’ve only used data frames so far, but I still don’t feel like I have a good handle on the differences between data frames, data tables, tibbles, matrices, etc. (and when you might need/want to use one or another). And as for variable types, I understand numeric (int or dbl), and I think I get factors, but I don’t think I know when I might want a number to be a factor and how to best work with it once it is. Lastly, the “simplified” examples in vignettes or other help are often written as vectors (and I understand why), but I don’t always understand how that translates to data frames and how I should manipulate variables within them to achieve results like the examples.
As you mentioned in your article, some of the challenges for a new R user like myself is trying to get the most out of the help files. I’d really appreciate any recommendations for resources that may use a more natural language to teach how these things work.
TIA,
Ken
Hi Ken,
I can’t help but be biased on this question, but my books, R for SAS and SPSS Users and R for Stata Users start at the very beginning, and build slowly until you have all the fundamentals of base R down pat. For example, many books show how to recode a variable, but don’t extend that to copying the originals, giving them new names, then recoding them and (optionally) adding them back to the original data set. I try to provide all the steps that you need in everyday use. Despite their names, you don’t need to know any other packages to read them, they just start each section describing very briefly how the other languages work so you’re more aware of how R differs.
As far as video training goes, DataCamp.com has a wonderful set of workshops that cover both base R and the Tidyverse perspective. Their cost is quite low, I think $9/$25 for students/non-students.
Cheers,
Bob
Thank you, Bob! I appreciate the response.
I did not know about your book (only just found your site today), or at least would likely have passed over it because of the SAS/SPSS references (as I mentioned, I’m new to programming, so my intent to focus on R alone would have led me away). I am getting mixed signals between reviews on your site and those on amazon regarding the strength of this book as a new programmer, not someone coming from SAS or SPSS. I am concerned that the comparisons will end up causing more confusion.
Do you have any thoughts on that?
Hi Ken,
The section on Renaming variables is a good example. I spend 6 pages on the various ways to rename variables (and sometimes observations) in R, and the only things I say about SAS and SPSS are at the beginning of those pages:
10.6 Renaming Variables (and Observations)
In SAS and SPSS, you do not know where variable names are stored or how. You just know they are in the data set somewhere. Renaming is simply a
matter of matching the new name to the old name with a RENAME statement. [End of SAS/SPSS discussion!] In R however, both row and column names are stored in attributes – essentially character vectors – within data objects. In essence, they are just another form of data that you can manipulate…
I could almost have named the book, “Intro to R with some comparisons to all other data science packages”. Each chapter does end with the code for R, SAS, and SPSS but the discussion is very focused on R. The SAS/SPSS book is more up to date than the Stata one, but otherwise they’re extremely similar.
Cheers,
Bob
Very helpful, thank you again!
Thanks for this post! I reference your blog a lot in what I write. You have some excellent commentary on R that I can’t find elsewhere. I really appreciate your well-researched and well-worded posts. Thank you!
Hi Monika,
Thanks for writing. You motivate me to keep going!
Cheers,
Bob
Thank you so much for your article. I tend to think R is some strange perversion. People are who are very good at at SPSS and SAS at the command language level would seem to get very frustrated by R. I programmed once a lot. I used punch cards actually. I have been using SPSS and SAS for 35 years. I don’t have the time to waste working through all of the R commands when I know there are so many shortcuts through the SPSS package that is extremely modifiable. And I can rely on others to develop some R plugins for me to use. Can you convince a person like me that I should use R and convince me it isn’t just so I can be a computer bro?
SPSS Slim Shady,
If you’re happy with SPSS, and don’t mind paying for it, stick with it! Switching to R will be a major effort for you since it’s so radically different. I love the freedom to earn consulting & training income using R (two kids in college) and my university license for SPSS forbids such use. I could get a commercial license for SPSS, but they’re pretty expensive. I also prefer the R language now, but it took me years of using it to get as fast as I am with SAS & SPSS.
You might want to download a copy of KNIME to try out. It’s like a free and open source version of SPSS Modeler. Powerful, easy to learn, and fun to play with.
Ah, the good ol’ punch cards days! Looking back I’m not quite sure why I enjoyed computing so much back then. It was such a pain!
Cheers,
Bob
I use R just for occasional analysis so have had to learn the idiosyncrasies the hard way. Your write up would have been a big help! I was learning lisp and the functional paradigm idea was strengthened enough that I realized writing helper functions to abstract out the idiosyncrasies was a big help. I also came from APL where vector operations were a given, so that approach made the idea of apply functions easy. My point is that beginners need to understand that r is a functional language with vectors as major concepts to embrace. Thanks for the educational insights!
Hi Steve,
APL was one of my favorites years ago. I had to stick a character generator chip in my PC to see the odd symbols!
Cheers,
Bob
I’ve been learning R over the past 7 months on my own and it’s not like any program I’ve used before. I’ve put too many hours, sleep-less nights into learning, stumbling over basic concepts while paradoxically being able to run more complex pieces of code in effort to learn while meeting analysis deadlines.
I’ve been particularity overwhelmed with getting some code to work tonight, so I decided to google R compared to other stat programs, and came across this article. What a great read! I am bookmarking because some really helpful code.
I definitely can relate with these points! It’s really nice to read that others are also having similar experiences. I didn’t realize R was considered “harder” to learn than some other software, or that really it’s fundamentally a language in itself! I thought because other stat programs have a command line, writing syntax was effectively like writing code in R. Now I’m realizing that is not the case at at.
Knowing that other people are also struggling or find some aspects of R downright confusing, it’s actually quiet validating.
Thank you for the post, it’s helped me feel less overwhelmed with the loom of analysis deadlines!
Hi Chrisy,
I’m glad you found the article helpful. I think many SAS or SPSS user knows fewer than 20 commands. It’s enough to get basic analyses done. But as you have no doubt found, it takes far more than that to use R. If the tidyverse could have completely replaced the older commands, that would have greatly simplified it, but you have to know the older commands too. R Markdown has added yet another layer of complexity. I love R, but it sure can drive me crazy at times!
Cheers,
Bob
This site rocks! But a textbook I found very helpful is the excellent “R in Action:
Data Analysis and Graphics with R” by Robert I. Kabacoff. Work through that for a structured approach, then buy Bob’s book.