Why R is Hard to Learn

[An updated version of this article is here]

The open source R software for analytics has a reputation for being hard to learn. It certainly can be, especially for people who are already familiar with similar packages such as SAS, SPSS or Stata. Training and documentation that leverages their existing knowledge and points out where their previous knowledge is likely to mislead them can save much of frustration. This is the approach used in my books, R for SAS and SPSS Users and R for Stata Users as well as the workshops that are based on them.

Here is a list of complaints about R that I commonly hear from people learning it. In the comments section below, I’d like to hear about things that drive you crazy about R.

Misleading Function or Parameter Names (data=, sort, if)

The most difficult time people have learning R is when functions don’t do the “obvious” thing. For example when sorting data, SAS, SPSS and Stata users all use commands appropriately named “sort.” Turning to R they look for such a command and, sure enough, there’s one named exactly that. However, it does not sort data sets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the “order” function sorts data sets and it does so in a somewhat convoluted way. However there are add-on packages that have sorting functions that work just as SAS/SPSS/Stata users would expect.

Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. When other packages require sorting before they can do three common tasks:

  1. Summarizing / aggregating data
  2. Repeating an analysis for each group (“by” or “split file” processing)
  3. Merging files by key variables

R does not need to sort files before any of these tasks! So while sorting is a very helpful thing to be able to do for other reasons, R does not require it for these common situations.

Nonstandard Output

R’s output is often quite sparse. For example, when doing crosstabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a crosstabulation is as simple as barplot( table(a,b) ). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages.

Too Many Commands

Other statistics packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, ANOVA table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There is even a command “coefficients” that gets only that part of the model. So there are more commands to learn but fewer options are needed for each.

R’s commands are also consistent, working across all the modeling types that they might apply to. For example the “predict” function works the same way for all types of models that might make predictions.

Sloppy Control of Variables

When I learned R, it came as quite a shock that in a single analysis you can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they thought had observations in the same order, but were not! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses R seems to be asking for disaster. I still recommend merging files when possible by key variables before doing an analysis.

So why does R allow this “sloppiness”? It does so because it provides very useful flexibility. For example, might plot regression lines of variable X against variable Y for each of three groups on the same plot. Then you can add group labels directly onto the graph. This lets you avoid a legend that makes your readers look back and forth between the legend and lines. The label data would contain only three variables: the group labels and the coordinates at which you wish them to appear. That’s a data set of only 3 observations so merging that with the main data set makes little sense.

Loop-a-phobia

R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses the “apply” family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same.

Functions That Act Like Procedures

Many other packages, including SAS, SPSS and Stata have procedures or commands that do typical data analyses which go “down” through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and those functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force then to go in either direction. So it’s true that in R, functions act like procedures and functions. Coming from other software, that’s a wild new idea.

Naming and Renaming Variables is Way Too Complicated

Often when people learn how R names and renames its variables they, well, freak out. There are many ways to name and rename variables because R stores the names as a character variable. Think of all the ways you know how to fiddle with character variables and you’ll realize that if you could use them all to name or rename variables, you have way more flexibility than the other data analysis packages. However, how long did it take you to learn all those tricks? Probably quite a while! So until someone needs that much flexibility, I recommend simply using R to read variable names from the same source as you read the data. When you need to rename them, use an add-on package that will let you do so in a style that is similar to SAS, SPSS or Stata. An example is here. You can convert to R’s built-in approach when you need more flexibility.

Inability to Analyze Multiple Variables

One of the first functions beginners typically learn is mean(X). As you might guess, it gets the mean of the X variable’s values. That’s simple enough. It also seems likely that to get the mean of two variables, you would just enter mean(X, Y). However that’s wrong because functions in R typically accept only single objects. The solution is to put those two variables into a single object such as a data frame: mean( data.frame(x,y) ). So the generalization you need to make isn’t from one variable to multiple variables, but rather from one object (a variable) to another (a data set). Since other software packages are not object oriented, this is a mental adjustment people have to make when coming to R from other packages. (Note to R gurus: I could have used colMeans but it does not make this example as clear.)

Poor Ability to Select Variable Sets

Most data analysis packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z). R generally lacks this useful ability. It does have a “subset” function that allows the form A:Z, but that form works only in that function. There are many various work-arounds for this problem but most do seem rather convoluted compared to other software. Nothing’s perfect!

Too Much Complexity

People complain that R has too much complexity overall compared to other software. This comes from the fact that you can start learning software like SAS and SPSS with relatively few commands: the basic ones to read and analyze data. However when you start to become more productive you then have to learn whole new languages! To help reduce repitition in your programs you’ll need to learn the macro language. To use the output from one procedure in another, you’ll need to learn an output management system like SAS ODS or SPSS OMS. To add new capabilities you need to learn a matrix language like SAS IML, SPSS Matrix or Stata Mata. Each of these languages has its own commands and rules. There are also steps for tranferring data or parameters from one language to another. R has no need for that added complexity because it integrates all these capabilities into R itself. So it’s true that beginners have to see more complexity in R. Howevever, as they learn more about R, they begin to realize that there is actually less complexity and more power in R!

Lack of Graphical User Interface (GUI)

Like most other packages R’s full power is only accessible through programming. However unlike the others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SAS, SPSS and Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI such as those used by SPSS or JMP. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.

Conclusion

Most of the issues described above are misunderstandings caused by expecting R to work like other software that the person already knows. What examples like this have you come across?

Acknowledgements

Thanks to Patrick Burns and Tal Galili for their suggestions that improved this post.

47 thoughts on “Why R is Hard to Learn”

  1. Function names in R are an inconsistent mess:
    row.names, rownames
    rownames, rowMeans, rowSums, rowsum
    browseURL, contrib.url, fixup.package.URLs
    package.contents, packageStatus
    mahalanobis, TukeyHSD
    getMethod, getS3method
    read.csv and write.csv
    load and save
    readRDS and saveRDS
    Sys.time, system.time (Help pages do not even crossreference!)
    cumsum, colSums

    aggregate(.., FUN, …)
    acast(…, fun.aggregate, …)

    1. My personal favorite along these lines is the built-in reshape function, the Hmisc reShape function and the reshape package. They all do similar things of course and the add-ons improve upon the built-in function. So it’s too bad there’s confusion, but the reshape package provides awesome reshaping and aggregating capabilities that are better than any other package that I’m aware of. Without the freedom for people to contribute their own variations, we wouldn’t have that power in R. The R Core guys could make things a bit more consistent for the built-in functions though.

  2. You danced around the motivating difficulty (or benefit, depends on one’s point of view): R is both a stat pack command language and an application programming language, with a single (?) syntax. For most working stat users, the command approach of BMD/SAS/PSTAT/SPSS/etc. is more natural and sufficient. If they had wanted to be programmers, they would have signed up for a CS curriculum. For those coming from a primary occupation as coder, R induces a gag reflex (claims of OO-ness and functional-ness are mostly hollow). I started as an econometrician and spent the last decade in coding, and I certainly see the oddities as language. That R is the product of stats rather than coders (SAS/SPSS at least) directed by stats is one of its hallmark criticisms.

    I dove into R a couple of years ago, and I’ve collected lots of texts on R and various distinct parts of it, and none are structured around this dichotomy (Maindonald gets the closest). I’ve not read your crossover texts though, since my SAS/SPSS days are too long ago to matter.

    Were I to write a book “Intro to R” (I’m not!), it would skip the usual order where we get programming syntax for the first 1/3 to 1/2 of the text, then some stats examples. Go straight for the “cookbook” of each major analysis area, thence on to the “programming R” bits. This would be perfect for e-books: the cookbook would be specialized to biostats or econometricians or psychometricians, et al; followed by the programming bits.

    1. Here’s the first line of the help file for the simplest R command: print.

      “print prints its argument and returns it invisibly (via invisible(x)).”

      When I first read that, I was totally lost. Does it print? Or is the output invisible? You really need some training in R before the help files make sense.

      1. well but it’s inevitable that you must have some background knowledge to understand a technical document. The function of help files in R is to provide a “comprehensive” description but this does not always mean “comprehensible” (easily understandable). You can’t be a complete beginner if you are looking what `print` does — after all, it is in most cases used implicitly. You don’t do print(3+2) but just 3+2. The help files for par or points are difficult but very useful but these are definitely not the places to start learning how to do graphics with R.

        For beginners, starting with help files if pretty hopeless- but there are plenty of tutorials and books available for the beginners.

  3. Excellent!
    I have a hunch that more and more R newbies are not coming from SAS, STATA, or SPSS, but from Excel. Have you found this to be true in your trainings? If so, what hiccups do they most commonly face?

  4. Factors tend to be confusing (Relabeling, reordering, converting to numbers, characters, etc.) and it is not cleaar why some “numeric” vectot become faactors without warning. Also, converting a list to a data.frame or matrix can be cumbersome when there are more than two list levels. Likewise, extracting a given element of a given list level for summary purposes is sometimes almost imposible.

    1. Factors and their value labels can indeed be confusing. However I’ve also helped some very confused SAS users trying to figure out the relationship between PROC FORMAT and FORMAT, especially when creating a common format library that they’re sharing among data sets. And there’s also the fun of receiving a SAS data set that has custom formats assigned but not the formats themselves. By comparison, SPSS’s labeling system is quite simple and clear.

  5. Your example for getting means of a data frame object is deprecated:


    > x <- rnorm(10); y mean( data.frame(x,y) )
    x y
    0.3689014 -0.2491133
    Warning message:
    mean() is deprecated.
    Use colMeans() or sapply(*, mean) instead.

    The problem is that a data-frame can have non-numeric data for which ‘mean’ is undefined. Since you’re taking a mean, your data is presumably numeric, in which case you can make it a matrix and use apply, or just call apply on the dataframe which will take care of that for you.


    > apply( data.frame(x,y),2,mean )
    x y
    0.3689014 -0.2491133

    ‘summary’ works pretty well on general data frames, handling numeric and non-numeric types gracefully.

    1. I mentioned that colMeans was more appropriate for the data frame but it doesn’t work on a vector. Your suggestion of using summary is a good one. I wish I had written it using that instead!

  6. I think there is one thing you have missed out that I still have a problem with though I am quite experienced with R. That is finding out how to do something and when you have found out what to use the help is incomprehensible with difficult examples.

    I often “know” I can do something in R but I cannot remember the package/function name. After fruitless tries I consult the forums to get some ideas. I go to the help and look for a quick example but the example offered is often designed to show off the amazing complexity of the function. To understand the example I have to learn a whole new subject!

    First example in the help should be an example for dummies.

    1. I agree. I know this sounds self-serving but I keep R for SAS and SPSS Users on my desk all the time. I have to use it to remind myself of how to do various tasks in all three languages. I’m driven crazy by using all three, often being able to recall the name of some command in one language when I need that ability in the other. The index includes the command name for all three languages for every entry. That’s why the index in the second edition got so long! For example I might need to use the “source” function in R but all I can remember is that SAS calls it “include”. Or vice versa.

    2. @JS: When you can’t remember the package or function you need, you could also check one of the R reference cards. This one is my favorite (short but well-organized list of many useful functions):
      http://statmaster.sdu.dk/bent/courses/ST501-2011/Rcard.pdf
      And I agree that many of the help files need more basic examples!

      @r4stats: Thanks for all the great resources here and in your useR talk last week! I think they’ll be very useful as I gently try to “convert” some SAS users at my workplace 🙂

  7. I personally agree with the statements here. I am a R beginner myself (used to use Clementine but the macro language there is not my taste).
    I just want to point out that R is sometimes inconsistent with the parameter values you are passing. Especially suppressing certain functionality is treated differently for different functions. Sometimes its required to use NA, NULL, sometimes “”, “o” or “n”. I always need to lockup the help to be sure to use the right value. An example is lines() and title() with NULL and “o” as an example I always forget about the “o”.Maybe there is some strategy behind but I haven’t found a proper source which explains it.

  8. I’d like to add a frustration: that each package can contain functions with the same name, and the older one gets “masked”, leaving the new one intact. So the user has to not only 1) figure out what package he/she needs, but also 2) when in a program to turn off a package in order to use another.

    As a SAS guy trying to learn R, the “logic” of R frustrates me to no end. The so-called flexibility that is often mentioned as one of R’s strong points is just chaos in my opinion. Hopefully working through your book will help my understanding, but as of yet, trying to keep from doing things wrong in R has me focused too much on the language and not focused enough on doing theory-correct work!

    1. I can certainly relate to that frustration. I experienced plenty of it when I started learning R. I even started compiling a list of SAS and R concepts that a person would need to know to be productive. I was convinced that the list for R would be twice as long as for SAS. Soon, both lists were quite long and of similar length. It made me realize that I had forgotten just how long it had taken me to get good at SAS. However I do think that you can get somewhat productive knowing a smaller amount of SAS. Getting fully productive is a major effort with both.

  9. When I tried R and seeked help from a user group I learned that have to pretend to be a programming /math geek. If you just want to solve a problem in, say, social science, use survey data to find the answer and R to do the stats , beware. The welcome is rough, and new comers with naive questions get grilled. Compare this to the SPSS list where people that have not even opened a manual get polite and helpful answers, you see the difference.

  10. I’m not that much of a online reader to be honest but your sites
    really nice, keep it up! I’ll go ahead and bookmark your website to come back in the future. Many thanks

    1. Dear Professor Mean,

      I love that name! I wholeheartedly agree. R users need to learn how to pick pieces out of objects, whether they be variables from a data set or coefficients from a model. This can be mind-boggling, especially when you’ve saved a set of models done by some grouping factor, which adds another layer to your output list. The first time I had to grab bits by group, it took me quite a while to get it right.

      Cheers,
      Bob

  11. Very late to the party, but I’m wondering if dplyr fixes some of the core language problems here. Certainly helps with ordering, subsetting and selecting data.

  12. The party is likely over but thought I should add a few of observations of my own to the discussion:

    – R is expecting a lot of new users, namely the ability to think from scratch about the logical flow of a statistical analysis and the ability to implement that analysis step by step and to know exactly what kind of R output should be reported, how it should be extracted from R and how it should be interpreted. This requires a complete reorganization of thinking processes and a level of confidence and comfort with statistics that many new users may lack.

    – R is always one step removed from the data one is working with. We need a better command than the View() command to view the data spreadsheet in a way that is standard in SPSS, say. It also makes it more difficult to add labels to variables and to incorporate those labels into plots and tables. (Hmisc has a label() function that is helpful. But, since R was developed by statisticians, shouldn’t base R come with such a function?)

    – R is like a language with inumerable dialects (i.e., packages). Many dialects sound more like completely novel languages (e.g., ggplot2 versus lattice) and require users to do a lot of learning upfront in order to become fluent. The fact that dialects sometimes conflict with each other doesn’t help.

    – R is also like a quilt which includes a series of often disconnected patches. This can result in many unanticipated side effects when writing R code, some of which could take hours or days to resolve. Packages may not communicate well with each other, functions may not react as expected when embedded in other functions, etc.

    – After all these years of R development efforts, I believe the lm() function still reports the R-squared for a simple linear regression as being “multiple R-squared”, which can be confusing for new users. I am not aware of other ready examples where the statistical message conveyed by R in its output is misleading, but there may be other similar examples out there.

    – Speaking of the lm() function and its relatives like glm, I think it would be nice if the users could see a message posted immediately after the summary of a model explaining that, for qualitative variables, by default R treats the first level of that variable as a reference variable and then compares all other levels against it in terms of the mean response, controlling for all other explanatory variables in the model.

    – R users should have the ability to rank R packages according to their usefulness, ease of documentation and value provided. This would enable other users to discard packages which are not very user friendly and opt instead for packages which have been well received by the R user community.

    The list goes on, but I should stop here.

    1. Hi Isabella,

      You’ve add some good ones! I’ve used a lot of stat packages and R is the only one I’ve seen that lacks variable labels. As you point out, some packages include them, but I agree that it should be in the main package. The lack of a good data editor I also find odd. The fix() function does it but it’s awfully primitive. Regarding the selection of quality packages, crantastic.org can help. Unfortunately it’s not used by too many people. The site Rdocumentation.org lists the top 10 packages on the right side of the page. I’ve asked its developers to extend that so people could regularly see the top 100 or so most popular packages. That would be a big help.

      Cheers,
      Bob

    2. Great comments. I’d like to add a few suggested fixes.

      1) There’s no real “R way” of doing things. We’re more like Pearl’s TMTOWDI (there’s more than one way to do it). Packages like dplyr are fixing that. With dplyr, you have an “R way” to do something. That’s a great start.
      2) For statistical research, I think R is great. Statisticians know how to interpret the results once they learn enough. For programmers who are learning statistics, like me, more user-friendly results would help
      3) I’d like to use R for more than simply analyzing data. I’d like to use it in production, inside my apps. while there are ways to do this, Opencpu for example. There isn’t a great workflow in R to create production applications. This is where I’d like to see frameworks written R that are extensible and easily testable. Testing data analysis isn’t easy, and we need help with that.
      4) Finally, I think of Ruby. Without Ruby on Rails, few would use Ruby. I’d like to see some of the major use cases treat R like Rails does. Create a DSL for data management — we have that with dplyr –, data analysis and data presentation. We’ve made great strides, with Shiny, RMarkdown and the like, but we’re not quite there yet.

  13. I don’t see the huge divide between “programming” and “computing” that others see. Of course, the student programming that I see is all problem solving, not rote exercises. Students can fail at the programming in many different ways (poorly chosen data structures, poorly chosen program organization, poorly chosen algorithms, incorrect coding of algorithms, bad documentation, non-idiomatic use of the programming language, …). I see very high correlations between these different failure modes—I don’t have large enough samples to do factor analysis, but I suspect that I would see primarily one factor for programming skill, with a minor second factor for completeness of documentation. I doubt very much that I’d be able to associate one factor with “programming” and another with “computing”.

  14. Then, lucky me R was my first programming language! That happened because I completely understood the first R code snippet I saw before knowing R ever existed.

    I came to programming from Physics, Mathematics and Statistics and I find nothing strange – or out of place – with R! Well, maybe the environment stack but one can live without unless they decide to take a developer dive to R.

    By-the-way, regarding the “mean” example shown above:

    mean(X, Y) could be very well interpreted as (X+Y)/2. Nothing tells me in this formula that I should calculate the individual means of X and Y! Nothing.

    This confusion is hard to make if X and Y are known to form some sort of container. Like a data.frame or matrix for example.

    Also about selecting variables in a data.table:

    it makes full sense to state the selection criterion. Slecting columns A:D from the data.frame df may overlook that.

    Instead,

    charcterVars <- colnames(df)[lapply(df, class) == 'character']

    shows explicitly that selection of these variables was based on variable's type (here "character").

    I consider dplyr and tydiverse (with its pipe operator, its "mutate" and "transmute" weirdos) attempts of hostile takeover by the "slick" and somehow vapid (compared to R) Python.

    My suggestion is stick with data.table instead. It is way more to gain.

    Great article, thank you!

    1. Hi MoreJobsinR,

      While I don’t use data.table, I’ve heard nothing but good things about it. It’s fast and it does feel much more like base R than do the tidyverse commands.

      Having come from years of SAS use, the tidyverse feels right to me.

      Cheers,
      Bob

    1. Hi Dragos,

      Yep, R is used for much more than stats. But being a statistician, it’s my main interest.

      Cheers,
      Bob

Leave a Reply to DragosCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.