*by Bob Muenchen*

In my ongoing quest to *analyze the world of analytics*, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

**Growth in Capability**

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in

Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R *functions* for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories except GitHub, on that date there were approximately 182,393 total functions in R. In total, R has *over 150 times* as many commands as SAS.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do R training on-site.

That is because R is so pedantic.

SAS is far more versatile, powerful and especially creative.

That is why it is on the increase in both the academic and commercial world, while R has plateaued or even decreased due to competition from other products such as Perl and Python.

Deny it, if you can!

Enjoy the Holiday weekend.

Mark Ezzo

Columbus Consulting Corporation

O: 610-666-1492

C: 267-261-5560

Hi Mark,

I agree that SAS is wonderfully versatile and powerful software. I especially like the latest release of SAS Studio. I wish RStudio offered as many features! SAS Institute’s own data show that overall SAS revenue is growing. However, in the main Popularity article I provide data that show usage of R is growing rapidly

in academia(Figure 2e) while the academic use of SAS is declining. I make it clear how I collect all my data to enable people to disagree using facts rather than opinion. Where is your data?I hope you enjoy the holiday weekend too; three days!

Cheers,

Bob

That is because R is so pedantic.

SAS is far more versatile, powerful and especially creative.

That is why it is on the increase in both the academic and commercial world, while R has plateaued or even decreased due to competition from other products such as Perl and Python.

Deny it, if you can!

Enjoy the Holiday weekend.

Mark Ezzo

Columbus Consulting Corporation

O: 610-666-1492

C: 267-261-5560

Hi Mark,

I agree that SAS is wonderfully versatile and powerful software. I especially like the latest release of SAS Studio. I wish RStudio offered as many features! SAS Institute’s own data show that overall SAS revenue is growing. However, in the main Popularity article I provide data that show usage of R is growing rapidly

in academia(Figure 2e) while the academic use of SAS is declining. I make it clear how I collect all my data to enable people to disagree using facts rather than opinion. Where is your data?I hope you enjoy the holiday weekend too; three days!

Cheers,

Bob

SAS has not expanded its footprint in either the academic or commercial markets. Its revenue growth is virtually flat, below expectations of its own management.

The functional comparison is misleading. Many R functions are redundant; where SAS has a single function for mixed linear models, R has many. There are virtues in that — if you are really into the details of mixed linear models, the differences may be important — but for many users it’s just noise.

Regards,

Thomas

Hi Thomas,

That’s an excellent point. Certainly no single person could ever master the vast array of functions that R offers, nor would one person ever need to. The same is true of SAS. SAS offers enough capability to meet the needs of the vast majority of researchers. In addition, when SAS Institute bothers to include a method of analysis, you know it has gone through a vetting process that indicates that it is a method that is important to a wide audience. New R functions come out at such an amazing pace that it’s hard to know which ones to learn and which are unlikely to become widely used. But if you happen to need one of those niche functions, R is the tool that’s much more likely to have it.

Cheers,

Bob

SAS has not expanded its footprint in either the academic or commercial markets. Its revenue growth is virtually flat, below expectations of its own management.

The functional comparison is misleading. Many R functions are redundant; where SAS has a single function for mixed linear models, R has many. There are virtues in that — if you are really into the details of mixed linear models, the differences may be important — but for many users it’s just noise.

Regards,

Thomas

Hi Thomas,

That’s an excellent point. Certainly no single person could ever master the vast array of functions that R offers, nor would one person ever need to. The same is true of SAS. SAS offers enough capability to meet the needs of the vast majority of researchers. In addition, when SAS Institute bothers to include a method of analysis, you know it has gone through a vetting process that indicates that it is a method that is important to a wide audience. New R functions come out at such an amazing pace that it’s hard to know which ones to learn and which are unlikely to become widely used. But if you happen to need one of those niche functions, R is the tool that’s much more likely to have it.

Cheers,

Bob

The informed user needs to remember that ALL the functions/PROCS in SAS are vetted by professional statisticians and numerical analysts; this is quality software. Many (most?) of the packages in R are written by persons with no knowledge of numerical analysis — many are written by self-taught programmers. Caveat Emptor! (Before anyone flames me, reread what I have written: I refer to the packages, not to the base-core.) Question: how do you distinguish between the packages written by people who know numerical analysis and the packages written by people who have never heard of Thisted or Kennedy and Gentle (standard texts on statistical computing)?

Hi B.D.,

You raise a critically important point. The main R download is thoroughly validated as described here: http://www.r-project.org/doc/R-SDLC.pdf. That document lists the packages, and R’s help() function will tell you which package any particular function is in. In version 3.2.0, the validation process covers 3,672 functions that are roughly the equivalent to:

Base SAS, GRAPH, STAT, ETS, IML, and some of Enterprise Miner. From that set it’s missing Structural Equation Modeling, Multiple Imputation, and its various graphical user interfaces such as Enterprise Guide and SAS Studio.

For SPSS users, it’s roughly the equivalent to IBM SPSS Base, Statistics, Advanced Statistics, Regression, Forecasting, Decision Trees, Neural Networks and Bootstrapping. Missing from that set is the SPSS graphical user interface.

So those commands you can count on for accuracy. Many other packages are based on books or journal articles that have passed the peer review process. When that’s the case, the functions are likely reliable. In fact, SAS and SPSS programmers probably followed those same books and journal articles aiming to get the same answers. However, many more come from sources of unknown accuracy and I recommend investigating them carefully before using them, just as you would a SAS or SPSS macro that you found on someone’s web site.

Cheers,

Bob

The informed user needs to remember that ALL the functions/PROCS in SAS are vetted by professional statisticians and numerical analysts; this is quality software. Many (most?) of the packages in R are written by persons with no knowledge of numerical analysis — many are written by self-taught programmers. Caveat Emptor! (Before anyone flames me, reread what I have written: I refer to the packages, not to the base-core.) Question: how do you distinguish between the packages written by people who know numerical analysis and the packages written by people who have never heard of Thisted or Kennedy and Gentle (standard texts on statistical computing)?

Hi B.D.,

You raise a critically important point. The main R download is thoroughly validated as described here: http://www.r-project.org/doc/R-SDLC.pdf. That document lists the packages, and R’s help() function will tell you which package any particular function is in. In version 3.2.0, the validation process covers 3,672 functions that are roughly the equivalent to:

Base SAS, GRAPH, STAT, ETS, IML, and some of Enterprise Miner. From that set it’s missing Structural Equation Modeling, Multiple Imputation, and its various graphical user interfaces such as Enterprise Guide and SAS Studio.

For SPSS users, it’s roughly the equivalent to IBM SPSS Base, Statistics, Advanced Statistics, Regression, Forecasting, Decision Trees, Neural Networks and Bootstrapping. Missing from that set is the SPSS graphical user interface.

So those commands you can count on for accuracy. Many other packages are based on books or journal articles that have passed the peer review process. When that’s the case, the functions are likely reliable. In fact, SAS and SPSS programmers probably followed those same books and journal articles aiming to get the same answers. However, many more come from sources of unknown accuracy and I recommend investigating them carefully before using them, just as you would a SAS or SPSS macro that you found on someone’s web site.

Cheers,

Bob

Bob,

SAS programmers (I know this for a fact) and I suspect SPSS programmers as well do NOT program from books and articles. For example, books typically present the least squares estimator as b = (X’X)^-1X’y but it should not be programmed this way. For further details see McCullough and Vinod, “The Reliability of Econometric Software,” Journal of Economic Literature, 1999.

Peer review is no assurance of accuracy is correctness, and peer review does not end debate. That an article is peer-reviewed only means that someone (the referees) think the debate can begin.

Regards,

Bruce

Bruce,

That’s a good example of a formula that an author may use in the text because it’s easy for readers to understand. But I would hope that if a professor knew enough to develop a new statistical method and program it into an R package, that s/he would know that that formula is terrible for use in programming. I assumed that SAS or SPSS programmers would study the code in the R package, not blindly follow formulas that are likely to lead to trouble. The most recent reference of which I am aware for the accuracy of R, SAS, etc. is this article, which builds on your earlier work: A comparative study of the reliability of nine statistical software packages, Keeling & Pavur, Computational Statistics & Data Analysis, Volume 51, Issue 8, 1 May 2007, Pages 3811–3831.

Best regards,

Bob

Cheers,

Bob

Bob,

SAS programmers (I know this for a fact) and I suspect SPSS programmers as well do NOT program from books and articles. For example, books typically present the least squares estimator as b = (X’X)^-1X’y but it should not be programmed this way. For further details see McCullough and Vinod, “The Reliability of Econometric Software,” Journal of Economic Literature, 1999.

Peer review is no assurance of accuracy is correctness, and peer review does not end debate. That an article is peer-reviewed only means that someone (the referees) think the debate can begin.

Regards,

Bruce

Bruce,

That’s a good example of a formula that an author may use in the text because it’s easy for readers to understand. But I would hope that if a professor knew enough to develop a new statistical method and program it into an R package, that s/he would know that that formula is terrible for use in programming. I assumed that SAS or SPSS programmers would study the code in the R package, not blindly follow formulas that are likely to lead to trouble. The most recent reference of which I am aware for the accuracy of R, SAS, etc. is this article, which builds on your earlier work: A comparative study of the reliability of nine statistical software packages, Keeling & Pavur, Computational Statistics & Data Analysis, Volume 51, Issue 8, 1 May 2007, Pages 3811–3831.

Best regards,

Bob

Cheers,

Bob

I think than having 150 x the number of commands, or having more that 5000 packages is fine for a geek mentality, to make inroads into corporate or structured environments like clinical, where metadata takes center stage; it creates a major maintenance problem. I think a better course is to tout the virtues of some of the better packages (e.g dplyr, ggplot), and create a 80/20 environment which is manageable.

Also this whole business of the internet top xx ranking of everything really disguises the true importance of data science and analytics.

Just my opinion.

Hi Ralph,

It does create a maintenance problem! The folks at RStudio address that with their packrat package (http://rstudio.github.io/packrat/) and Revolution Analytics (now Microsoft) addresses it with their Managed R Archive Network. Neither are a perfect solution, but they’re way better than the lack of solution that preceded their work.

With regard to the true importance of data science and analytics, once someone appreciates that, the next thing they want to know is which tools work well. That’s a really multidimensional question, but finding out which have large market share and which are growing or declining in a particular market is one way to address it. That’s what much of my work is aimed at.

Cheers,

Bob

I think than having 150 x the number of commands, or having more that 5000 packages is fine for a geek mentality, to make inroads into corporate or structured environments like clinical, where metadata takes center stage; it creates a major maintenance problem. I think a better course is to tout the virtues of some of the better packages (e.g dplyr, ggplot), and create a 80/20 environment which is manageable.

Also this whole business of the internet top xx ranking of everything really disguises the true importance of data science and analytics.

Just my opinion.

Hi Ralph,

It does create a maintenance problem! The folks at RStudio address that with their packrat package (http://rstudio.github.io/packrat/) and Revolution Analytics (now Microsoft) addresses it with their Managed R Archive Network. Neither are a perfect solution, but they’re way better than the lack of solution that preceded their work.

With regard to the true importance of data science and analytics, once someone appreciates that, the next thing they want to know is which tools work well. That’s a really multidimensional question, but finding out which have large market share and which are growing or declining in a particular market is one way to address it. That’s what much of my work is aimed at.

Cheers,

Bob

I have been having problems working on non linear models such as wood, wilmink, djistra in R for modelling lactation curves. how do R handle this

Hi Oludayo,

I don’t have time to answer such specific questions, but some great question & answer web sites are listed on the right-hand side of my web site under “Q & A Sites”.

Best regards,

Bob