In my ongoing quest to *analyze the world of analytics*, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

**Growth in Capability**

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in

Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R *functions* for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories except GitHub, on that date there were approximately 182,393 total functions in R. In total, R has *over 150 times* as many commands as SAS.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do R training on-site.

]]>

Analytics tools take significant effort to master, so once learned people tend to stick with them for much of their careers. This makes the tools used in academia of particular interest in the study of future trends of market share. I’ve been tracking The Popularity of Data Analysis Software regularly since 2010, and I thanks to an astute reader, I now have a greatly improved estimate of Stata’s academic growth. Peter Hedström, Director of the Institute for Analytical Sociology at Linköping University, wrote to me convinced that I was underestimating Stata’s role by a wide margin, and he was right.

Two things make Stata’s popularity difficult to guage: 1) Stata means “been” in Italian, and 2) it’s a common name for the authors of scholarly papers and those they cite. Peter came up with the simple, but very effective, idea of adding Statacorp’s headquarter, College Station, Texas, to the search. That helped us find far more Stata articles while blocking the irrelevant ones. Here’s the search string we came up with:

("Stata" "College Station") OR "StataCorp" OR "Stata Corp" OR "Stata Journal" OR "Stata Press" OR "Stata command" OR "Stata module"

The blank between Stata and College Station is an implied logical “and”. This string found 20% more articles than my previous one. This success motivated me to try and improve some of my other search strings. R and SAS are both difficult to search for due to how often those letters stand for other things. I was able to improve my R search string by 15% using this:

"r-project.org" OR "R development core team" OR "lme4" OR "bioconductor" OR "RColorBrewer" OR "the R software" OR "the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR "knitr" OR "RODBC" OR "stringr" OR "mass package"

Despite hours of effort, I was unable to improve on the simple SAS search string of “SAS Institute.” Google Scholar’s logic seems to fall apart since “SAS Institute” OR “SAS procedure” finds fewer articles! If anyone can figure that out, please let me know in the comments section below. As usual, the steps I use to document all searches are detailed here.

The improved search strings have affected all the graphs in the Scholarly Articles section of The Popularity of Data Analysis Software. At the request of numerous readers, I’ve also added a log-scale plot there which shows the six most popular classic statistics packages:

If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop,

R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do training on-site but I’m often booked about 8 weeks out.

I invite you to follow me on this blog and on Twitter.

]]>

It would be useful to have growth trend graphs for each of the analytics packages I track, but collecting such data is too time consuming since it must be re-collected every year (since search algorithms change). What I’ve done instead is collect data only for the past two complete years, 2013 and 2014. Figure 2e shows the percent change from 2013 to 2014, with the “hot” packages whose use is growing shown in red. Those whose use is declining or “cooling” are shown in blue. Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 100 articles in 2013. Going from one to five articles may represent 500% growth, but it’s not of much interest.

The three fastest growing packages are all free and open source: Python, R and KNIME. All three saw more than 25% growth. Note that the Python figures are strictly for analytics use as defined here. At the other end of the scale are SPSS and SAS, both of which declined in use by around 25%. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

Three of the packages whose use is growing implement the powerful and easy-to-use workflow or flowchart user interface: KNIME, RapidMiner and SPSS Modeler. As useful as that approach is, it’s not sufficient for success as we see with SAS Enterprise Miner, whose use declined nearly 15%.

It will be particularly interesting to see what the future holds for KNIME and RapidMiner. The companies were two of only four chosen by the Gartner Group as having both a complete vision of the future and the ability to execute that vision (Fig. 7a). Until recently, both were free and open source. RapidMiner then started charging for its current version, leaving its older version as the only free one. Recent offers to make it free for academic use don’t include use on projects with grant funding, so I expect KNIME’s higher rate of growth to remain faster than RapidMiner’s. However, in absolute terms, scholarly use of RapidMiner is currently almost twice that of KNIME, as shown in Fig. 2b.

]]>

So what happened? We’re looking back across many years, so while it’s possible that SPSS suddenly became much more popular in 2014, that could not account for lifting the whole trend line. It’s possible Google Scholar improved its algorithm to find articles that existed previously. It’s also possible that new journal archives have opened themselves up to being indexed by Google. Why would it affect SPSS more than SAS or R? SPSS is menu-driven so it’s easy to install with its menus and dialog boxes translated into many languages. Since SAS and R are much more frequently used via their English-based languages, they may not be as popular in non-English speaking countries. Therefore, one might see a disproportionate impact on SPSS by new non-English archives becoming available. If you have an alternate hypothesis, please leave it in the comments below.

The remainder of this post is the complete updated section on this topic from The Popularity of Data Analysis Software:

**Scholarly Articles**

The more popular a software package is, the more likely it will appear in scholarly publications as a topic or as a tool of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2014. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. The software from Java through Statgraphics show a slow decline in usage from highest to lowest. Note that the general purpose software C, C++, C#, MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest.

From RapidMiner on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 825 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. RapidMiner, KNIME, SPSS Modeler and SAS Enterprise Miner are packages that all use the powerful and easy-to-use workflow interface, but their use has not yet caught on among scholars. BMDP is one of the oldest packages in existence. Its use has been declining for many years, but it’s still hanging in there. The software in the bottom half of this figure contain the newcomers, with the notable exception of Megaputer, whose Polyanalyst software has been around for many years now.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2c I’ve plotted the same scholarly-use data for 1995 through 2014, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).

Since SAS and SPSS dominate the vertical space in Figure 2c by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP, with the result shown in Figure 2d. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. If the current trends continue, the use of R may pass that of SPSS and SAS by the end of 2016. Note that current trends have shifted before as discussed here.

Stata has moved into fourth place, crossing above Statistica in 2014. The growth in the use of Stata is more rapid than all the classic statistics packages except for R. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another. [Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”]

I’ll announce future update on Twitter, where you can follow me as @BobMuenchen.

]]>

**Survey Link: www.rexeranalytics.com/Data-Miner-Survey-2015-Intro.html**

**Access Code: ****R8M4E2**** **

Survey results will be unveiled at the Fall-2015 Boston Predictive Analytics World event.

Rexer Analytics has been conducting the Data Miner Survey since 2007. Each survey explores the analytic behaviors, views and preferences of data miners and analytic professionals. Over 1,200 people from around the globe participated in the 2013 survey. Summary reports (40 page PDFs) from previous surveys are available FREE to everyone who requests them by emailing DataMinerSurvey@RexerAnalytics.com. Also, highlights of earlier Data Miner Surveys are available at www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html, including best practices shared by respondents on analytic success measurement, overcoming data mining challenges, and other topics. The FREE Summary Report for this 2015 Data Miner Survey will be available to everyone Fall-2015.

Please help spread the word.

*Rexer Analytics is a consulting firm focused on providing data mining and analytic CRM solutions. Recent solutions include customer loyalty analyses, customer segmentation, predictive modeling to predict customer attrition and to target direct marketing, fraud detection, sales forecasting, market basket analyses, and complex survey research. More information is available at www.RexerAnalytics.com or by calling +1 617-233-8185.*

]]>

Each year, the Gartner Group, “the world’s leading information technology research and advisory company”, collects data in a survey of the customers of 42 business intelligence firms. They recently released the data on the customers’ plans to discontinue use of their current software in one to three years. The results are shown in the figure below. Over 16% of the SAS Institute customers surveyed reported considering discontinuing their use of the software, the highest of any of the vendors shown. It will be interesting to see if this will actually lead to an eventual decline in revenue. Although I have helped quite a few organizations migrate from SAS to R, I would be surprised to see SAS Institute’s revenue decline. They offer excellent software and service which I still use, though not anywhere near as much as R.

The full Gartner report is available here.

]]>

My next two live webinars done in partnership with Revolution Analytics are in January:

R for SAS, SPSS and Stata Users

Managing Data with R (updated to include dplyr, broom, tidyr, etc.)

Course outlines and registration for both is here.

My R for SAS, SPSS and Stata Users workshop is also now available as a self-paced interactive video workshop at DataCamp.com.

I do site visits in partnership with RStudio.com, whose software I recommend and use in every form of my training. If your company does its training through Xerox Learning Services, I also partner with them. For further details or to arrange a site visit, you can reach me at muenchen.bob@gmail.com.

]]>

]]>

Here is my latest update to *The Popularity of Data Analysis Software*. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:

- As I covered in my talk at the UseR 2014 meeting, it is very likely that during the summer of 2014, R became the most widely used analytics software for scholarly articles, ending a spectacular 16-year run by SPSS.
- Stata has probably passed Statistica in scholarly use, and its rapid rate of growth parallels that of R.

If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.

**Scholarly Articles**

The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.

Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.

From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.

Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.

Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2015 (correction: this said 2014 when originally posted).

After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”

Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.

]]>

**Growth in Capability**

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.

Figure 8 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.998). The right-most point is for version 3.0.2, the last version released in 2013.

As rapid as this growth has been, these data represent only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 4/7/2014 counted 7,364 R packages at all major repositories, 5,323 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 38% higher on the y-axis than the one shown in Figure 8. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contains around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 36,820 functions compared to SAS’s 1,200. *In fact, during 2013 alone, R added more functions/procs than SAS Institute has written in its entire history!* That’s 835 packages, counting only CRAN, or around 4,175 functions. Of course these are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do. However, R functions can nest inside one another, creating nearly infinite combinations. Also, SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would be happy to list it here. While the comparison is not perfect, it does provide an interesting perspective on the size and growth rate of R.

]]>