Why R is Hard to Learn

The open source R software for analytics has a reputation for being hard to learn. It certainly can be, especially for people who are already familiar with similar packages such as SAS, SPSS or Stata. Training and documentation that leverages their existing knowledge and points out where their previous knowledge is likely to mislead them can save much of frustration. This is the approach used in my books, R for SAS and SPSS Users and R for Stata Users as well as the workshops that are based on them.

Here is a list of complaints about R that I commonly hear from people learning it. In the comments section below, I’d like to hear about things that drive you crazy about R.

Misleading Function or Parameter Names (data=, sort, if)

The most difficult time people have learning R is when functions don’t do the “obvious” thing. For example when sorting data, SAS, SPSS and Stata users all use commands appropriately named “sort.” Turning to R they look for such a command and, sure enough, there’s one named exactly that. However, it does not sort data sets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the “order” function sorts data sets and it does so in a somewhat convoluted way. However there are add-on packages that have sorting functions that work just as SAS/SPSS/Stata users would expect.

Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. When other packages require sorting before they can do three common tasks:

  1. Summarizing / aggregating data
  2. Repeating an analysis for each group (“by” or “split file” processing)
  3. Merging files by key variables

R does not need to sort files before any of these tasks! So while sorting is a very helpful thing to be able to do for other reasons, R does not require it for these common situations.

Nonstandard Output

R’s output is often quite sparse. For example, when doing crosstabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a crosstabulation is as simple as barplot( table(a,b) ). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages.

Too Many Commands

Other statistics packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, ANOVA table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There is even a command “coefficients” that gets only that part of the model. So there are more commands to learn but fewer options are needed for each.

R’s commands are also consistent, working across all the modeling types that they might apply to. For example the “predict” function works the same way for all types of models that might make predictions.

Sloppy Control of Variables

When I learned R, it came as quite a shock that in a single analysis you can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they thought had observations in the same order, but were not! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses R seems to be asking for disaster. I still recommend merging files when possible by key variables before doing an analysis.

So why does R allow this “sloppiness”? It does so because it provides very useful flexibility. For example, might plot regression lines of variable X against variable Y for each of three groups on the same plot. Then you can add group labels directly onto the graph. This lets you avoid a legend that makes your readers look back and forth between the legend and lines. The label data would contain only three variables: the group labels and the coordinates at which you wish them to appear. That’s a data set of only 3 observations so merging that with the main data set makes little sense.

Loop-a-phobia

R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses the “apply” family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same.

Functions That Act Like Procedures

Many other packages, including SAS, SPSS and Stata have procedures or commands that do typical data analyses which go “down” through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and those functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force then to go in either direction. So it’s true that in R, functions act like procedures and functions. Coming from other software, that’s a wild new idea.

Naming and Renaming Variables is Way Too Complicated

Often when people learn how R names and renames its variables they, well, freak out. There are many ways to name and rename variables because R stores the names as a character variable. Think of all the ways you know how to fiddle with character variables and you’ll realize that if you could use them all to name or rename variables, you have way more flexibility than the other data analysis packages. However, how long did it take you to learn all those tricks? Probably quite a while! So until someone needs that much flexibility, I recommend simply using R to read variable names from the same source as you read the data. When you need to rename them, use an add-on package that will let you do so in a style that is similar to SAS, SPSS or Stata. An example is here. You can convert to R’s built-in approach when you need more flexibility.

Inability to Analyze Multiple Variables

One of the first functions beginners typically learn is mean(X). As you might guess, it gets the mean of the X variable’s values. That’s simple enough. It also seems likely that to get the mean of two variables, you would just enter mean(X, Y). However that’s wrong because functions in R typically accept only single objects. The solution is to put those two variables into a single object such as a data frame: mean( data.frame(x,y) ). So the generalization you need to make isn’t from one variable to multiple variables, but rather from one object (a variable) to another (a data set). Since other software packages are not object oriented, this is a mental adjustment people have to make when coming to R from other packages. (Note to R gurus: I could have used colMeans but it does not make this example as clear.)

Poor Ability to Select Variable Sets

Most data analysis packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z). R generally lacks this useful ability. It does have a “subset” function that allows the form A:Z, but that form works only in that function. There are many various work-arounds for this problem but most do seem rather convoluted compared to other software. Nothing’s perfect!

Too Much Complexity

People complain that R has too much complexity overall compared to other software. This comes from the fact that you can start learning software like SAS and SPSS with relatively few commands: the basic ones to read and analyze data. However when you start to become more productive you then have to learn whole new languages! To help reduce repitition in your programs you’ll need to learn the macro language. To use the output from one procedure in another, you’ll need to learn an output management system like SAS ODS or SPSS OMS. To add new capabilities you need to learn a matrix language like SAS IML, SPSS Matrix or Stata Mata. Each of these languages has its own commands and rules. There are also steps for tranferring data or parameters from one language to another. R has no need for that added complexity because it integrates all these capabilities into R itself. So it’s true that beginners have to see more complexity in R. Howevever, as they learn more about R, they begin to realize that there is actually less complexity and more power in R!

Lack of Graphical User Interface (GUI)

Like most other packages R’s full power is only accessible through programming. However unlike the others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SAS, SPSS and Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI such as those used by SPSS or JMP. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.

Conclusion

Most of the issues described above are misunderstandings caused by expecting R to work like other software that the person already knows. What examples like this have you come across?

Acknowledgements

Thanks to Patrick Burns and Tal Galili for their suggestions that improved this post.

About these ads

About Bob Muenchen

I help researchers analyze their data and write books about research computing.
This entry was posted in Analytics, R, SAS, SPSS, Statistics, Uncategorized and tagged , , , , , . Bookmark the permalink.

36 Responses to Why R is Hard to Learn

  1. Kevin Wright says:

    Function names in R are an inconsistent mess:
    row.names, rownames
    rownames, rowMeans, rowSums, rowsum
    browseURL, contrib.url, fixup.package.URLs
    package.contents, packageStatus
    mahalanobis, TukeyHSD
    getMethod, getS3method
    read.csv and write.csv
    load and save
    readRDS and saveRDS
    Sys.time, system.time (Help pages do not even crossreference!)
    cumsum, colSums

    aggregate(.., FUN, …)
    acast(…, fun.aggregate, …)

    • Bob Muenchen says:

      My personal favorite along these lines is the built-in reshape function, the Hmisc reShape function and the reshape package. They all do similar things of course and the add-ons improve upon the built-in function. So it’s too bad there’s confusion, but the reshape package provides awesome reshaping and aggregating capabilities that are better than any other package that I’m aware of. Without the freedom for people to contribute their own variations, we wouldn’t have that power in R. The R Core guys could make things a bit more consistent for the built-in functions though.

  2. Robert Young says:

    You danced around the motivating difficulty (or benefit, depends on one’s point of view): R is both a stat pack command language and an application programming language, with a single (?) syntax. For most working stat users, the command approach of BMD/SAS/PSTAT/SPSS/etc. is more natural and sufficient. If they had wanted to be programmers, they would have signed up for a CS curriculum. For those coming from a primary occupation as coder, R induces a gag reflex (claims of OO-ness and functional-ness are mostly hollow). I started as an econometrician and spent the last decade in coding, and I certainly see the oddities as language. That R is the product of stats rather than coders (SAS/SPSS at least) directed by stats is one of its hallmark criticisms.

    I dove into R a couple of years ago, and I’ve collected lots of texts on R and various distinct parts of it, and none are structured around this dichotomy (Maindonald gets the closest). I’ve not read your crossover texts though, since my SAS/SPSS days are too long ago to matter.

    Were I to write a book “Intro to R” (I’m not!), it would skip the usual order where we get programming syntax for the first 1/3 to 1/2 of the text, then some stats examples. Go straight for the “cookbook” of each major analysis area, thence on to the “programming R” bits. This would be perfect for e-books: the cookbook would be specialized to biostats or econometricians or psychometricians, et al; followed by the programming bits.

  3. peterflom says:

    These are some of the reasons. Two more: Incomprehensible error messages and help files

    • Bob Muenchen says:

      Here’s the first line of the help file for the simplest R command: print.

      “print prints its argument and returns it invisibly (via invisible(x)).”

      When I first read that, I was totally lost. Does it print? Or is the output invisible? You really need some training in R before the help files make sense.

      • Kenn says:

        well but it’s inevitable that you must have some background knowledge to understand a technical document. The function of help files in R is to provide a “comprehensive” description but this does not always mean “comprehensible” (easily understandable). You can’t be a complete beginner if you are looking what `print` does — after all, it is in most cases used implicitly. You don’t do print(3+2) but just 3+2. The help files for par or points are difficult but very useful but these are definitely not the places to start learning how to do graphics with R.

        For beginners, starting with help files if pretty hopeless- but there are plenty of tutorials and books available for the beginners.

  4. Tom says:

    Excellent!
    I have a hunch that more and more R newbies are not coming from SAS, STATA, or SPSS, but from Excel. Have you found this to be true in your trainings? If so, what hiccups do they most commonly face?

  5. nelson says:

    Factors tend to be confusing (Relabeling, reordering, converting to numbers, characters, etc.) and it is not cleaar why some “numeric” vectot become faactors without warning. Also, converting a list to a data.frame or matrix can be cumbersome when there are more than two list levels. Likewise, extracting a given element of a given list level for summary purposes is sometimes almost imposible.

    • Bob Muenchen says:

      Factors and their value labels can indeed be confusing. However I’ve also helped some very confused SAS users trying to figure out the relationship between PROC FORMAT and FORMAT, especially when creating a common format library that they’re sharing among data sets. And there’s also the fun of receiving a SAS data set that has custom formats assigned but not the formats themselves. By comparison, SPSS’s labeling system is quite simple and clear.

  6. efrique says:

    Your example for getting means of a data frame object is deprecated:


    > x <- rnorm(10); y mean( data.frame(x,y) )
    x y
    0.3689014 -0.2491133
    Warning message:
    mean() is deprecated.
    Use colMeans() or sapply(*, mean) instead.

    The problem is that a data-frame can have non-numeric data for which ‘mean’ is undefined. Since you’re taking a mean, your data is presumably numeric, in which case you can make it a matrix and use apply, or just call apply on the dataframe which will take care of that for you.


    > apply( data.frame(x,y),2,mean )
    x y
    0.3689014 -0.2491133

    ‘summary’ works pretty well on general data frames, handling numeric and non-numeric types gracefully.

  7. JS says:

    I think there is one thing you have missed out that I still have a problem with though I am quite experienced with R. That is finding out how to do something and when you have found out what to use the help is incomprehensible with difficult examples.

    I often “know” I can do something in R but I cannot remember the package/function name. After fruitless tries I consult the forums to get some ideas. I go to the help and look for a quick example but the example offered is often designed to show off the amazing complexity of the function. To understand the example I have to learn a whole new subject!

    First example in the help should be an example for dummies.

    • Bob Muenchen says:

      I agree. I know this sounds self-serving but I keep R for SAS and SPSS Users on my desk all the time. I have to use it to remind myself of how to do various tasks in all three languages. I’m driven crazy by using all three, often being able to recall the name of some command in one language when I need that ability in the other. The index includes the command name for all three languages for every entry. That’s why the index in the second edition got so long! For example I might need to use the “source” function in R but all I can remember is that SAS calls it “include”. Or vice versa.

    • @JS: When you can’t remember the package or function you need, you could also check one of the R reference cards. This one is my favorite (short but well-organized list of many useful functions):

      http://statmaster.sdu.dk/bent/courses/ST501-2011/Rcard.pdf

      And I agree that many of the help files need more basic examples!

      @r4stats: Thanks for all the great resources here and in your useR talk last week! I think they’ll be very useful as I gently try to “convert” some SAS users at my workplace :)

  8. Sebastian says:

    I personally agree with the statements here. I am a R beginner myself (used to use Clementine but the macro language there is not my taste).
    I just want to point out that R is sometimes inconsistent with the parameter values you are passing. Especially suppressing certain functionality is treated differently for different functions. Sometimes its required to use NA, NULL, sometimes “”, “o” or “n”. I always need to lockup the help to be sure to use the right value. An example is lines() and title() with NULL and “o” as an example I always forget about the “o”.Maybe there is some strategy behind but I haven’t found a proper source which explains it.

  9. hlson says:

    Reblogged this on Tự do.

  10. Randy Zwitch says:

    I’d like to add a frustration: that each package can contain functions with the same name, and the older one gets “masked”, leaving the new one intact. So the user has to not only 1) figure out what package he/she needs, but also 2) when in a program to turn off a package in order to use another.

    As a SAS guy trying to learn R, the “logic” of R frustrates me to no end. The so-called flexibility that is often mentioned as one of R’s strong points is just chaos in my opinion. Hopefully working through your book will help my understanding, but as of yet, trying to keep from doing things wrong in R has me focused too much on the language and not focused enough on doing theory-correct work!

    • Bob Muenchen says:

      I can certainly relate to that frustration. I experienced plenty of it when I started learning R. I even started compiling a list of SAS and R concepts that a person would need to know to be productive. I was convinced that the list for R would be twice as long as for SAS. Soon, both lists were quite long and of similar length. It made me realize that I had forgotten just how long it had taken me to get good at SAS. However I do think that you can get somewhat productive knowing a smaller amount of SAS. Getting fully productive is a major effort with both.

  11. ftreu says:

    When I tried R and seeked help from a user group I learned that have to pretend to be a programming /math geek. If you just want to solve a problem in, say, social science, use survey data to find the answer and R to do the stats , beware. The welcome is rough, and new comers with naive questions get grilled. Compare this to the SPSS list where people that have not even opened a manual get polite and helpful answers, you see the difference.

  12. groppy.com says:

    I’m not that much of a online reader to be honest but your sites
    really nice, keep it up! I’ll go ahead and bookmark your website to come back in the future. Many thanks

  13. Pingback: Leveraging Expertise from the Crowd | Center for Infectious Disease Dynamics Graduate Student Association

  14. Pingback: Python Displacing R As The Programming Language For Data Science

  15. Pingback: Python Displacing R As The Programming Language For Data Science ← TechOver.Me

  16. Pingback: Is Python Becoming the Boa Constrictor of the Data Science Forest? | Experfy Blog

  17. Pingback: Python Displacing R As The Programming Language For Data Science | DIGIZENS

  18. I’m a bit late to this discussion, but another big hurdle for R is understanding how to get a “piece” of something bigger. Sometimes you use [ . . . ], and sometimes [ [ . . . ] ] and sometimes $ and sometimes @.

    • Bob Muenchen says:

      Dear Professor Mean,

      I love that name! I wholeheartedly agree. R users need to learn how to pick pieces out of objects, whether they be variables from a data set or coefficients from a model. This can be mind-boggling, especially when you’ve saved a set of models done by some grouping factor, which adds another layer to your output list. The first time I had to grab bits by group, it took me quite a while to get it right.

      Cheers,
      Bob

  19. Pingback: Is Python Becoming the King of the Data Science Forest? - Experfy Insights

  20. Pingback: ¿Se está convirtiendo Python en el rey de Big Data? - BigDataHispano

  21. Leo Godin says:

    Very late to the party, but I’m wondering if dplyr fixes some of the core language problems here. Certainly helps with ordering, subsetting and selecting data.

  22. Kimman says:

    I would say that “R is easy to learn but difficult to master” or “..but it takes time to master.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s