R Passes SPSS in Scholarly Use, Stata Growing Rapidly

by Robert A. Muenchen

[Since this was originally published in 2014, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]

Here is my latest update to The Popularity of Data Analysis Software. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:

  • As I covered in my talk at the UseR 2014 meeting, it is very likely that during the summer of 2014, R became the most widely used analytics software for scholarly articles, ending a spectacular 16-year run by SPSS.
  • Stata has probably passed Statistica in scholarly use, and its rapid rate of growth parallels that of R.

If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.

Scholarly Articles

The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.

Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.

From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.

Figure 2a. Number of scholarly articles found for each software.
Figure 2a. Number of scholarly articles found for each software.

Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.

Fig_2b_ScholarlyImpactBig6
Figure 2b. Number of scholarly articles found for the top five classic statistics packages.

Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2015 (correction: this said 2014 when originally posted).

After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”

Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.

Fig_2c_ScholarlyImpactBig6
Figure 2c. Number of scholarly articles that reference each software by year, after removing the top two, SPSS and SAS, and adding the next two most popular, Systat and JMP.

 

27 thoughts on “R Passes SPSS in Scholarly Use, Stata Growing Rapidly”

  1. Very interesting. The series for SPSS looks very wierd though. What could explain the huge increase in such a short space and then a rapid fall? This would seem to suggest that the total number of articles using stat software rose and fell from 2007. Figure 2c looks plausible, 2a does not.

    1. Hi Kevin,

      I think what’s going on with SPSS is that they did a fantastic user interface that is very easy to learn and use. Back when they did that, they had very little competition, mostly those packages shown in 2b. A huge proportion of the packages shown in 2a did not exist 10 years ago when SPSS use was still growing. One thing that makes 2a hard to believe is that SPSS was present and dominant for many more years than most of the other packages have existed. Figure 2b looks like the overall use of statistics packages is down, when in reality I’m sure it has increased. However, the increase is spread across so many more packages now.

      Cheers,
      Bob

  2. It is almost certain that R will continue it’s rise in the scholarly articles. It may be the popularity of R that SAS has decided to release University Edition for free. Whether this free version of SAS changes things remains to be seen. But setting up this free version of SAS is not a cup of tea. SAS demands very high level computer configuration to run on pc. So this may a major disadvantage of free edition of SAS, as on the other side installation of R is very smooth. Also this free edition does not contain all components of the original SAS.

    Only time will tell who will dominate the market ultimately.

  3. I’ve been considering building a “quant’s machine”, i5-4XXX with 32 gig of memory, and may be a video card. I’d use the Intel chipset video first to see. I’ve got enough decent SSD already. Mobo, cpu, memory, psu, and case for such a machine can be had for under $1K. While SAS/SPSS can do Big Data (what inferential stat cares about that, anyway???) more easily since it processes in RBAR fashion, for even largish datasets, such a machine is sufficient.

    1. Hi Robert,

      With 32 GB of RAM you should be able to analyze many millions of records. However, I’d still be happier if R could analyze billions. I have quite a few colleagues who need to analyze samples of just a few thousand cases, but they have billions of records to sift through to get it. They have to use other tools to do that before turning to R for analysis.

      Cheers,
      Bob

  4. From personal experience at our university, stata has become quite aggressive in its sales pitch, portraying open source (especially R) negatively and playing on the fears of staff/faculty who are not well educated in programming languages or statistical software, and like to have their hand held.

    1. That’s interesting since Stata is one of the few companies I’ve never gotten a sales contact from & I do the research software licensing for the University of Tennessee at all campuses. When I started tracking this data, I expected Stata would be the software hit hardest by competition from R. Most Stata users I know write programs (unlike SPSS point-and-clickers) and Stata is very similar to R in its extensibility. Instead perhaps that extensibility has helped its growth as it has R’s. I do find the Stata language much easier to learn.

      Cheers,
      Bob

  5. Interesting! And a bit disappoinitng, as I thought that R was catching up faster. What I find really weird is how far from a zero sum game the competition is. The total number of citations had a huge peak. Why? We certainly aren’t doing less data analysis. Was it the rise of Excel that meant data analysis was done by unmentioned software? Or something else that I’m missing?

    1. Hi Alan,

      Keep in mind that Fig 2b only shows six of the thirty packages, and there are probably twenty more that either don’t have enough sales volume for me to bother tracking or which don’t meet the criteria I state at the beginning of the main article (e.g. only works with vendor’s own data base). I’m sure if we added them all up we’d see a trend of more analysis across time. I don’t have sales figures, but I’m not aware of a single company whose sales have declined over this period now that business intelligence is acknowledged as a key factor to success.

      Cheers,
      Bob

  6. For years, SPSS was strongly involved in the academic market, making student versions of SPSS available via textbook bundles for an additional $25 or so beyond the cost of the textbook alone. After IBM bought SPSS, it steered away from the academic market, stopped the inexpensive student version bundling, went after the business market, even renaming the product PASW (Predictive Analytics SoftWare) for a while. I also found SPSS got more difficult to use for more advanced statistics, or as I like to put it, it made easy things easy, and hard things hard. R, on the other hand, does require some familiarity with a command line interface and simple programming, so in contrast to SPSS, the easy things are harder, but the harder things are easier. (Since on my view, writing a few lines of code is easier for doing more advanced work, vs. clicking a bunch of different boxes on different pop ups, and remember exactly how to do so.)

    Jim Lacey

    1. Hi Jim,

      Yes, IBM’s pricing has been very aggressive, increasing our cost at around 15% per year. The EDUCAUSE Software Licensing Issues Constituent Group Listserv has had many reports of pricing problems. IBM does have a program that makes some of their software free for teaching purposes, but it’s not very practical. Each professor has to contact them, get the software and distribute it to their students. That results in a crazy amount of work for a major university.

      I definitely agree with the oft-made comment about R that it makes easy things hard and hard things easy. The addition of the new dplyr package helps make some of those easy things easy again. I’ve just revamped my workshops extensively to include it.

      Cheers,
      Bob

  7. Using second graph and adding all the packages/scripting languages up for 2007 one gets around 360,000 (225k for SPSS, 75K SAS and another 4 at 15K); doing the same one gets around 165,000 for 2014 (75K SPSS, 30K SAS, 25K R and another 3 at 35K combined).

    The main conclusion of the second graph would appear to be many more articles mentioned specific scripting languages or stats packages in 2007 than in 2014; unless there has been a precipitous drop in the use of stats scripting languages and software in general.

    Which I doubt.

    1. Hi Don,

      If I had the time to collect annual data on all 30 packages in the first graph, and then plotted them on the second, I think we’d see that there has been a transition from the older packages to the newer ones.

      Cheers,
      Bob

  8. Hi guys
    You are all experienced in SPSS but i am new & beginner to this subjects, i read your all conversation regarding SPSS really its a healthy discussion… thank you for sharing

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.