by Robert A. Muenchen
How can we measure the popularity or market share of analytic software? One way is to see what people are discussing. I’m in the process of updating my annual article, The Popularity of Data Analysis Software. Below is the newly updated Internet Discussion section. Don’t bother to read the rest of the main article unless you’re in a hurry. I’ve been collecting data for several of the other more interesting plots and will have more to report in following posts. As always, I’m very interested in getting feedback. If you know of other discussion forums that I can collect data on without too much effort, please let me know. Internet Discussion There are some stable and objective measures regarding analytic software. Schwartz (2009) suggested estimating relative popularity by plotting the amount of email discussion devoted to each. The most widely used packages all have discussion lists, or “listservs” devoted to them. The less popular ones either do not have such discussions or, like the lists for Minitab or S-PLUS, may have only a dozen or so emails per year. Some software packages have multiple discussion lists. For example, there are 21 devoted to using R for various focused areas such as graphics, mapping, ecology, epidemiology, etc. (http://www.r-project.org/mail.html). A broader list, including a version of R-Help in Spanish, lists 49 discussions (https://stat.ethz.ch/mailman/listinfo). Figure 1a shows the level of activity on only each main discussion listserv in a typical month (i.e. forums, news groups and Google groups are excluded). Each point represents the sum of the 12 monthly counts that occurred in that year. This plot contains data through the end of 2012. If you read this article in previous years, this plot used to display the mean number of emails per month rather than the sum. Therefore the scale of the y-axis is different but the relative locations of the points are virtually identical. I made this change to enable better a better comparison to discussion forums (e.g. Fig. 1b).
We can see that discussion of R has grown the most rapidly and, for the past few years, R is the most discussed software by an almost two-to-one margin. In recent years, it is followed by Stata, SAS and SPSS, respectively. Stata showed steady discussion growth until it passed SAS in 2010. SAS saw rapid growth in its discussion until 2006 when it leveled off and then declined. That decline coincided with the strong growth of both R and Stata, offering competition to SAS. SPSS held steady at a low rate across the time frame, which may be attributable to its great ease of use relative to the other packages. With both the interface and the documentation aimed at people who prefer GUIs over programming, there’s less need to ask how to do variations on an analysis. In fact, there’s less ability to do such variations. As a result, I doubt SPSS’ low showing in this graph is indicative of its popularity or market share. It would be interesting to see what topics were most discussed on each list. The only such analysis of which I am aware was done by Arthur Tabachnek (2010) for the SAS list. The most popular topic in 2009 turned out to be…R! You can read his full analysis here under slides from the 2010 session. In the last year or two, R and Stata joined SAS in the decline in listserv discussion. Given the sharp increase in the popularity of business analytics, Big Data, and so on, it is unlikely that people are using or talking about these tools less. Instead, alternative forums of discussion have appeared. The site Stack Overflow (http://stackoverflow.com) covers a wide range of programming and statistical topics, while its sister site, Cross Validated (http://stats.stackexchange.com/), focuses only on statistical analysis. A third site, Talk Stats (http://www.talkstats.com), also focuses on statistical analysis. At all three sites, users tag their topics making it particularly easy to focus searches. Figure 1b shows the software people are discussing there.
We can see that the discussion of R is dramatically higher than the other packages, which don’t differ very much. Much of this difference is due to the influence of Stack Overflow, reflecting the vastly greater popularity of R as a programming language. However, even removing that effect, it is easy to see that R still dominates the discussions on the more statistically-oriented forums. This data is cumulative, but it would be very interesting to see how it grew by year. Without access to such data, at least we have the data in Fig. 1a to give us a feel for history.
Other popular discussion forum sites are LinkedIn.com and Quora.com. Neither of these sites make it easy to count number of posts, but they do display the number of people who have joined discussion groups (Figure 1c).
In Figure 1c we get a better view of corporate software use. I do not know the ratio of corporate to academic use of LinkedIn, but among the academics I do know (quiet a few) they use it very little. In this world, SAS is the leader with R close behind. It’s interesting to see SPSS with a 50% lead over Stata; it was also slightly higher in Fig. 1b. Remember these are people who have joined a group, not necessary people who are talking as the previous two figures were. Still, group membership should be a reasonable proxy for popularity or market share. In the coming weeks, I’ll be updating the data on which software scholars are using, the growth of R packages and what skills employers are seeking in their new hires.
Copyright 2013, Robert A. Muenchen
You should be able to get time series for Stack Overflow and Cross Validated tags, by asking their Meta communities or their admins. Here’s one graph showing R against SAS (tool cited here. R currently has 3,000+ questions on Cross Validated, whereas SAS and Stata have less than 200 (and SPSS has a bit more than 300, which is interesting).
Fr.,
That’s great! I’ll look into adding that.
Thanks,
Bob
hello Bob,
I am a researcher in need of a small help in R.
If a graph is given as an input in R, is there any tool or method by which, I can get the slope of it. Lets take ECG signals as input, I need to get the slope and need to sum it up.
Can you help in this question? or can you give me some contacts of people who can answer this.
Hope to receive your help.
Hi Terry,
Here are some digitizer programs that can covert a graph to data, then you can use various R tools to get the formula of the graph.
http://plotdigitizer.sourceforge.net/
http://www.digitizeit.de/
I hope that does what you need.
Cheers,
Bob
Definitely an interesting read. It would be interesting to see a comparison between R and Python though as I personally think these two are in direct competition (and exchange) in many different companies.
Hi Eric,
That’s covered in the main article, third section, entitled, “Language Popularity Measures”. That section is up-to-date as of February 2013 (as I write this in 2/13).
Cheers,
Bob
In my personal experience, R is exploding in popularity, especially since big players like the FDA have mentioned it by name as an allowable system for submitting data analyses. This actually coincides with a decrease I have been seeing in SAS, because a lot of the people who used to tout SAS said it was required for certain research areas. Now that more statistical packages are becoming viable, the high cost of SAS is holding back its adoption. SPSS maintaining a near-constant rate of adoption makes sense to me, since it is primarily pointed at academic researchers who like to look at their data in spreadsheet form.
Now that we have more and more programmers out there around the world, R is making more sense to a lot of people. As far as I am aware, it is the only free/low-cost statistics solution that is robust enough to support high-risk work and accepted enough to not throw up red flags in a research article.
Hi Dinre,
I agree. When the FDA accepts something, it’ a huge stamp of approval with implications far outside of the pharmaceutical market.
Cheers,
Bob
why did traffic decline from 2010 to 2012
Hi Ajay,
Nice hearing from you! Your book, R for Business Analytics, just arrived & it’s in my “to read” pile. I’ve read one good review of it already. Congratulations & thanks for mentioning me in the acknowledgements.
I believe the traffic on the main listservs (Fig. 1a) declines from 2010 to 2012 only because traffic moved to more user-friendly forums. I just added a graph (Fig. 1c – a new 1c) in the main article (http://bit.ly/statpop) that actually shows the amount of discussion on Stack Overflow skyrocketing in the last few years.
Cheers,
Bob
Hi, Interesting analysis, as always. I would agree, I think the decrease in listserve traffic is probably attributable to movement to other forums. For SAS, over the past few years there has been a lot of shift in activity toward https://communities.sas.com/community/support-communities and http://www.sascommunity.org/wiki/Main_Page. Obviously sites dedicated to SAS wont help with your efforts to compare popularity of different packages. But I would guess if you combined posts on those sites with SAS listserve posts, the trend in SAS discussion level would look quite different.
Bob
What’s your review of the book I wrote
Ajay
Hi Ajay,
Your book, R for Business Analytics was swiped by my two graduate research assistants the moment it arrived & I haven’t seen it since. I need to go pry it out of their hands. I hope it’s selling well!
Cheers,
Bob