*by Robert A. Muenchen*

I’m slowly gathering all the data needed to update my ongoing article, The Popularity of Data Analysis Software. The section below is the latest installment.

**Growth in Capability**

The capability of all the software in this article has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.

Figure 10 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.995). Early version numbers of R increase by 0.10 while more recent ones increased by 0.01. To make the x-axis consistent, the graph displays simply the numerical order in which the versions were released. The right-most point is for version 2.15.2, the last version released in 2012.

As rapid as this growth has been, the data in Figure 10 represents only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 3/19/2013 counted 6,275 R packages at all major repositories, 4,315 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 30% higher on the y-axis than the one shown in Figure 10. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In its most recent version, 9.3, SAS offers 100 programming statements, 258 procedures (Base, STAT, ETS, Graph, HP Forecasting, Macro, OR, QC) and 520 SAS functions and call routines, and 314 IML statements, functions and subroutines for a total of 1,192 items that are roughly equivalent to R functions. R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 31,375 functions compared to SAS’ 1,192. *In fact, during 2012 alone, R added more functions/procs than SAS Institute has provided in its entire history!* That’s 701 packages, counting only CRAN, or around 3,505 new functions in 2012.

Of course these R functions and SAS procedures / functions are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, giving them potentially more output per command. However, R functions can nest inside one another, creating nearly infinite combinations of output. While the comparison is not perfect, it is certainly an eye opener.

Stay tuned for future updates which will include what employers are now advertising for and recent trends in academic use of analytic software.

SAS was state of the art in 1979. It is still state of the art – for 1979. The biggest problem I have with SAS is the licence. I often have to run large simulations in the cloud, and I just cant scale SAS that way.

I don’t know if SAS is available at all on clouds like Amazon’s. I do know that for HPC clusters SAS was unaffordable in academia until recently when they re-established their unlimited-copies license. So academia can now afford it but I suspect commercially it’s still charged on a per-core basis. That gets expensive fast on a cluster.

Dear Bob

I am beginning to use SAS and R. Now I want to write program for my data by using Random Forest method. But I don’t know how use R packages to input data by this method?

Please, can you help me ?

with best regards

Awaz

Hi Awaz,

I don’t have time to provide one-on-one consulting but you’re likely to find help at sites like these:

All the official R discussion lists: http://www.r-project.org/mail.html

Just about RStudio: http://support.rstudio.org/

R Programming: http://stackoverflow.com

Statistics in R: http://stats.stackexchange.com/ (probably the best for your question).

Cheers,

Bob

Of those 31000 R functions, there is bound to be a lot of duplicity.

On the other hand, there are some fairly basic mixed models that you can’t fit in R that you could fit in PROC MIXED 15+ years ago. (Toeplitz, anyone?) With the work on distributed mixed models that SAS is currently developing, R is falling even farther behind in this area. I think this demonstrates that good, comprehensive, robust mixed modeling is very hard, and needs full-time support, not part-time volunteers.

There are indeed functions in R that are duplicates. My books show how to do most analyses twice: once using R’s built-in functions whose output is sparse and easily used in further analysis and again using add-ons that mimic the more comprehensive output offered by SAS, SPSS or Stata. That said, the biggest part of my SAS count includes IML commands that are duplicates of SAS such as mean, median, min, max, trig, etc.

I think the main reason R has so many more functions is that people have written packages for many niches, such as those for the analysis of data from bioinformatics, chemistry and physics. For example, since bioinformaticians use R heavily, there are currently 1,297 packages devoted just to their type of data.

You make a good point that even given R’s huge number of functions, SAS does some things that R doesn’t do, including a Toeplitz covariance matrix in a mixed model. However, I’m under the impression that in the area of large mixed models, SAS’ new HPMIXED procedure is just now catching up with R. An example analysis done in R’s lme4 package had 5,212,017 observations. The model had nearly 2 million levels of random effects (teachers & students) and 29 fixed-effects parameters.

Both SAS’s HPMIXED and lme4 are far behind ASREML. More than a decade. In 1999, ASREML could fit large, complex-variance models that are still impossible in SAS or R.

Here is one example with a complex variance structure:

http://stats.stackexchange.com/questions/18709/lme4-or-other-open-source-r-package-code-equivalent-to-asreml-r

Here is another example using a very simple model with, what these days would be considered a fairly modest dataset (24000 rows).

https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20090806/c933f190/attachment.pl

This was before 64-bit R, so I tried the example again.

R> library(asreml)

R> system.time(m1 system.time(m0 <- lmer(y~1+L+(1|H), data=dat))

ASREML used less than 2 seconds. I killed lmer after half an hour.

To Paul C: I posit that I know more about mixed models than you. English too. I've presented some evidence. Show me yours. :-) I did not buy SAS. Cheers.

Hi Kevin,

Very interesting! I hadn’t heard of ASREML.

Thanks,

Bob

Honestly I’m pretty sure you don’t know nothing about R. Because I’m working on mixed models since many time ago and there aren’t any SAS procedures that you cannot replicate in R. In fact, I think you’re only trying to justify you stupid expensive buy (if you bought your SAS software). I’m so sorry, SAS is becoming a dead software, exactly as SPSS is going to be nothing.

This is a useful blog it would be a shame to see it degenerate into abuse.

Paul C., looking at help(corClasses) in the nlme package, I see many structures, but not Toeplitz. How do you handle this in R?

If R packages have a median of 5 functions, it does not follow that you can multiply x n packages to get the total. I suspect the total is higher, however R functions can be fairly trivial and are usually much less complex than sas procedures.

The mean number of functions in a package is 20.2 as of December, 2012. R functions can be trivial but I did include 518 SAS functions and quite a few IML ones so half the SAS count is similarly trivial. I agree that most SAS procs are much more complex than R modeling functions, but R’s ability to next functions helps make up for that. It’s a far from perfect comparison, but I found the scale of it quite interesting.

some concerns on the data and assumptions-

1) duplicate R functions as well as packages (including deprecated packages on non CRAN repos)

2) ignoring options within SAS procs ( as seperate functions) versus parameters in R functions . are you counting them (like proc sort vs proc sort nodupe). SAS language has lot more parameters per proc than R functions ( not really a good thing, but it exists)

3) ignoring usage and redunancy within SAS procedures as well. proc freq and proc means are counted as 2 procs and proc report and proc summary and proc univariate as 3 procs. there are multiple ways to do the same thing within R ( and in SAS to a lesser extent).

4) why so biased against SAS language? being pro R doesnt mean being anti-SAS.

5) SAS doing rather well financially for such a slow (and apparently obsolete) language.

Probably the reason SAS is doing well is because it is ‘entrenched’ (for the lack of better word). I dont think this post is anti-SAS. If anything I see it as a pro-R, I guess SAS is a big target so comparison against SAS is probably what happens naturally.

yes. I think you are right.

1) Good point, but as I mentioned above, there are quite a few duplicate functions between SAS and IML too. Note that the growth I mentioned in 2012 is CRAN only, so it excludes any deprecated packages on non-CRAN respositories.

2) I’m not counting options on SAS procs or arguments on R functions. If you can think of a way to do so, I’d love to!

3) Excellent point. However, I suspect there is more duplication within R than within SAS.

4) That’s funny, when I predicted that R use would NOT exceed that of SAS or SPSS in the foreseeable future, even within the more R-oriented academy, people wrote to ask my why I was so biased against R! (See http://r4stats.com/2012/05/09/beginning-of-the-end/). SAS is top quality software and whenever my clients prefer it, I’m quite happy to use SAS.

5) Yes, indeed. I don’t think SAS honchos Goodnight and Sall are losing much sleep worrying about R. Their annual SASware Ballot has kept them focused on customer needs. Other than the disastrous 9.2 installation headaches (mostly resolved in 9.3) and their ongoing GUI chaos, I think they’ve done a good job.

It is hardly fair to characterise people like Pinheiro and Bates as part time volounteers. They are serious researchers who have made significant developoments in methodology who choose to implement their techniques in R because that is the best way of getting them used (and their papers cited!). That is one of the ways high quality R packages get developed.

I probably am anti SAS – at least anti the SAS licence. It is simply more money than I think is reasonable given the ageing design of the language. I have to buy sas licences for my business, I have some clients who insist on it. But the licence is very restrictive and it really annoys me that I have to pay out so much cash for a product that has such ugly syntax and such limited capability to move beyond a selection of pre packaged techniques somebody else thinks I should be using.

Sure I could use IML – but at an additional cost and there is so much more for me to modify and tweak to my needs in R.

One of the reasons SAS does so well is because of the widespread assumption in the pharmaceutical industry that the FDA mandates its use.

R’s requirement that all data structures are in memory (no longer quite true) used to be a serious limitation for large scale data mining. We used to rely on SAS for some problems precisely because of that. It is much less of an issue given modern comoputer architechture.

How did you count the functions in R and the procs in SAS to begin with.

Bob’s Reply:

Ajay added this in a reply to a reply so I can’t reply to that so I’m answering it by appending this to his question. I used an R program to count R packages, then multiplied by the median number of functions in a package, 5. The mean number is 20.2 if you care to use that. The SAS commands I counted by going to each SAS manual online and copying the names of each proc or function and pasting them into an Excel spreadsheet. Since they’re usually listed all in a column, that’s not hard. -Bob Muenchen

Wouldn’t time period between major releases also be important is determining how growth rate in package contributions has grown? Is it similar or has it changed over time?

New versions tend to come out once around April or May and again in October or November. This is complicated by the fact that the number of packages is overwritten for a 0.01 release. For example, if you look now, you will only see the number of packages for 2.15.3 (a 2013 release), which overwrote the number for 2.15.2 (the last release in 2012). Since version numbers also change by variable units, I used an index of simply 1, 2, 3… where each unit represents six months of time.

Sorry, be aware. SAS is going to be dead. Even if you think different, SAS is going to be dead. Rest in peace f#$@%&g expensive software, rest in peace dear SAS

Your cogent argument has convinced me. Thanks for laying it out so clearly.

Bob,

Can you point me to the lme4 example with 5million obs and with random effect of over 2million levels? The examples I found for lme4 are generally in a smaller scale (still thousands of levels).

Thanks

That example is cited in Douglas Bates’ workshop notes:

http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/1Simple.pdf (search for Doran).

I wrote to him and he said the data were not publicly available, but that people interested in getting large data sets to try out should contact the R-SIG-Mixed-Models mailing list:

https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models.

Pingback: This headline overstates the R vs

Pingback: Trends in the Analytics Job Market | r4stats.com

Pingback: What R Has Been Missing | r4stats.com

Could you point me to the methodology you followed to get the number of available packages for each R release? I have been looking here: https://svn.r-project.org/R/branches/ thinking that there might be clues.

Hi aa,

That’s the correct location. It’s pretty tedious digging around in there though!

Cheers,

Bob

Pingback: Trends in the Analytics Job Market | Apoyo Estadístico