Only last August I wrote that among scholars, the use of R had probably exceeded that of SPSS to become their most widely used software for analytics. That forecast was based on Google Scholar searches focused on one year at a time, from 1995 through 2014. Each year from 2010 through 2014, I re-collected that entire data set just in case Google changed the search algorithm enough to affect the overall pattern. The data stayed roughly the same for those four years, but Google Scholar now finds almost twice as many articles for SPSS (at its peak year of 2008) than it found last year and 12% more articles for SAS. Changes in search results for articles that used R varied slightly with fewer in the early years and more in the latter ones. So R did not become the most widely used analytics software among academics in 2014. It’s unlikely to become so for another two years, unless present trends change.
So what happened? We’re looking back across many years, so while it’s possible that SPSS suddenly became much more popular in 2014, that could not account for lifting the whole trend line. It’s possible Google Scholar improved its algorithm to find articles that existed previously. It’s also possible that new journal archives have opened themselves up to being indexed by Google. Why would it affect SPSS more than SAS or R? SPSS is menu-driven so it’s easy to install with its menus and dialog boxes translated into many languages. Since SAS and R are much more frequently used via their English-based languages, they may not be as popular in non-English speaking countries. Therefore, one might see a disproportionate impact on SPSS by new non-English archives becoming available. If you have an alternate hypothesis, please leave it in the comments below.
The remainder of this post is the complete updated section on this topic from The Popularity of Data Analysis Software:
The more popular a software package is, the more likely it will appear in scholarly publications as a topic or as a tool of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.
Figure 2a shows the number of articles found for each software package for the most recent complete year, 2014. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. The software from Java through Statgraphics show a slow decline in usage from highest to lowest. Note that the general purpose software C, C++, C#, MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest.
From RapidMiner on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 825 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. RapidMiner, KNIME, SPSS Modeler and SAS Enterprise Miner are packages that all use the powerful and easy-to-use workflow interface, but their use has not yet caught on among scholars. BMDP is one of the oldest packages in existence. Its use has been declining for many years, but it’s still hanging in there. The software in the bottom half of this figure contain the newcomers, with the notable exception of Megaputer, whose Polyanalyst software has been around for many years now.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2c I’ve plotted the same scholarly-use data for 1995 through 2014, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).
Since SAS and SPSS dominate the vertical space in Figure 2c by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP, with the result shown in Figure 2d. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. If the current trends continue, the use of R may pass that of SPSS and SAS by the end of 2016. Note that current trends have shifted before as discussed here.
Stata has moved into fourth place, crossing above Statistica in 2014. The growth in the use of Stata is more rapid than all the classic statistics packages except for R. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another. [Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”]
I’ll announce future update on Twitter, where you can follow me as @BobMuenchen.