by Robert A. Muenchen
Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alpine, Alteryx, Angoss, C / C++ / C#, BMDP, FICO, IBM SPSS Statistics, IBM SPSS Modeler, InfoCentricity Xeno, Java, JMP, Lavastorm, Mathworks’ MATLAB, Megaputer’s PolyAnalyst, Minitab, NCSS, Python, R, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., SAP KXEN, TIBCO Spotfire, Stata, Statistica, Systat, WEKA / Pentaho.
I don’t attempt to differentiate among variants of languages such as R vs. Revolution R Enterprise, or SAS vs. the World Programming System (WPS) or Carolina, except when it is particularly easy such as comparing the company Pagerank figures.
Excluded from the list are products that focus on report writing (e.g. Cognos), or are tied to a specific database (e.g. Microsoft, Oracle, SAP’s HANA), specific hardware (e.g. Teradata, IBM PureData) or a specific application field.
The first section on jobs is currently the newest and covers the most extensive set of software. I’ll add more software to the later sections as soon I can and announce the changes on Twitter where you can follow me as @BobMuenchen.
When choosing a tool for data analysis, now more commonly referred to as analytics, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output in the form you prefer (e.g. cut & paste into a word processor vs. LaTeX integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it?
There are many ways to measure popularity or market share and each has its advantages and disadvantages. Here they are, in approximate order of usefulness:
- Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
- Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
- Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
- Website Popularity – the PageRank measure is objective data, and for sites that clearly focus on analytics, it’s unbiased and especially useful for weeding out the weaker software. However, so much market consolidation has occurred that now focused analytic tools like SPSS are listed under corporations with much broader interests (IBM in that case). In addition, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, that have nothing to do with its use for analytics.
- Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Unfortunately, this measure is very hard to collect except where sites exist to maintain such lists.
- Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
- Discussion Forum Activity – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others. While talk may be cheap, it’s still a good indicator of popularity.
- Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment. However, very popular commercial software may not have much user development activity.
- Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
- IT Research Firm Reports – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
- Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
- Competition Use – organizations that sponsor analytic competitions occasionally report what the winners tend to use. Unfortunately this information is only sporadically available.
- Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.
One of the best ways to measure the popularity or market share of software for analytics is to count the number of job advertisements for each. Indeed.com is the biggest job site in the U.S. making its sample the best around. As their CEO and co-founder Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” To demonstrate just how dominant its lead is, a search for SPSS (on 2/19/14) showed more than ten times as many jobs on Indeed.com as on its well-known competitor, Monster.com. Indeed.com also has superb search capabilities and it even includes a tool for tracking long-term trends.
Searching for analytics jobs using Indeed.com can be easy, but it can also be very tricky. For many of the analytics software that required only a simple search on its name. However, for software that’s hard to locate (e.g. R) or that is general purpose (e.g. Java) it required complex searches and/or some rather tricky calculations which are described here. All of the graphs in this section use those procedures to make the required queries.
Figure 1a shows that Java is in the lead followed by SAS. Python or C, C++/C# are roughly tied for third place. The tie between C and Python is not surprising as many advertisements for analytics jobs that use programming mention both together. (The C variants are combined in a single search since job advertisements usually seek any of them).
R resides in an interestingly large gap between the other domain-specific languages, SAS and SPSS. R has not only caught up with SPSS, but surpassed it with around 50% more job postings. MATLAB has many similarities to R so it’s interesting to see that it has only around half the job postings. Note that these are specific to analtyics and MATLAB has many engineering jobs that are not counted in this total.
Much of the software had fewer than 250 jobs. When displayed on the same graph as the industry leaders, their job counts appeared to be zero. Therefore I have plotted them separately in Figure 1b. FICO comes out the leader of this group, followed by Enterprise Miner. Statistica and Alteryx are close to tied at around 55 jobs. From RapidMiner on down, the decline in jobs is fairly smooth. Megaputer’s Polyanalyst job count is actually zero.
It’s important to note that the values shown in Figures 1a and 1b are single points in time. The number of jobs for the more popular software do not change much from day to day. Therefore the relative rankings of the software shown in Figure 1a is unlikely to change much over the coming year. The less popular packages shown in Figure 1b have such low job counts that their ranking is likely to shift from month to month. In addition, each software has an overall trend that shows how the demand for jobs changes across the years. You can plot such trends using Indeed.com’s Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time means they both have to fit in the same query limit allowed by Indeed.com. Those details are described here.
I’m particularly interested in trends involving R, so let’s look at a couple of comparisons. Figure 1c compares the number of analytics jobs available for R and SPSS across time. Analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.
We know from Figure 1a that SAS is still far ahead of R in analytics job postings. How far does R have to go to catch up with SAS? Figure 1d provides one perspective. It would be nice to have the data to forecast when R’s growth curve will catch up with SAS’s, but Indeed.com does not provide the raw data. However, we can use the approximate slope of each line to get a rough estimate. If jobs for SAS stay level and those for R continue to grow linearly as they have since January 2010, then R will catch up in 3.35 years. If instead the demand for SAS jobs that started in January of 2012 continues, then R will catch up in 1.87 years.
A debate has been taking place on the Internet regarding the relative place of Python and R. Ironically, this debate about software to do data analytics has involved very little actual data. However it is possible now to at least study the job trends. Figure 1a showed us that Python is well out in front of R, at least on that single day the searches were run. What has the data looked like over time? The answer is in Figure 1e.
Note that in this graph, Python appears to have less of advantage in Figure 1e than it had in Figure 1a. The final point on the trend graph was done only a few days after the queries used in Figure 1a, and that data changed very little in the meantime. The difference is due to the fact that Indeed.com has a limit on query length. Here is the query used for Figure 1e, and the analytic terms it contains were fewer than the one used for Figure 1a.
R and ("big data" or "statistical analysis" or "data mining" or "data analytics" or "machine learning" or "quantitative analysis" or "business analytics" or "statistical software" or "predictive modeling") !"R D" !"A R" !"H R" !"R N" !toys !kids !" R Walgreen" !walmart !"HVAC R" !"R Bard" , python and ("big data" or "statistical analysis" or "data mining" or "data analytics" or "machine learning" or "quantitative analysis" or "business analytics" or "statistical software" or "predictive modeling")
One last trend I considered was for Megaputer’s PolyAnalyst. Using the string “Megaputer PolyAnalyst” (“or” is implied) the trend line was completely flat at zero. I only include it here because Gartner considered Megaputer worth including in their Magic Quadrant for Advanced Analytics Platforms report of February 19, 2014.
The detailed description regarding the construction of all the queries used in Figures 1a through 1e is located here.
While Internet search engines make it very easy to locate information about software, their inclusive nature make it difficult to narrow the search enough to determine the prevalence of various packages. For example, searching for the term “SAS” quickly locates the main web site for the SAS Institute, but it also ends up including many hits regarding a shoe company, an airline and the British commando group. Even within the realm of scholarly journal articles, S.A.S. stands for over a dozen terms such as Synthetic Aperture Sonar.
The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. Google Scholar offers a convenient way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The final set of search terms is described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/. Figure 2a shows the number of articles for the most popular six statistics packages from 1995 through 2012. SPSS had a surprising advantage over most other package for much of this time. Its advantage seems suspiciously large, but after fairly extensive study of the articles returned by the search, it does not seem to be spurious.
Each year I collect the entire data set again, adding the previous complete year’s data to it. This ensures that the data are collected using the same Google algorithm, which is adjusted over time. The 2011 version of Fig. 2a contained a notable difference: the rate of decline in SPSS usage was leveling off between 2010 and 2011. This year’s search did not show that slowing and, in fact, the 2011 to 2012 change continued SPSS’ steep decline.
Use of SPSS and SAS in scholarly articles peaked in 2005 and 2008, respectively. The decline they have seen since may be due to competition from the other packages. The total of the other packages in 2011 is a similar to the amount of decline that SPSS and SAS have seen since their peak. If the trends were to continue on their current trajectories, scholarly use of R could surpass the use of SAS and SPSS in 2013. That’s a very big “IF” of course! See this blog post for more discussion of this forecast.
Since SAS and SPSS still dominate scholarly use by such a wide margin, I removed those two packages and added JMP and Statistica as shown in Fig. 2b. That figure shows the rapid rise of all software except Statistica. Note that the symbols and colors used in Fig. 2b do not match those in 2a. From 2008 on, R reaches the #3 spot (after SPSS and SAS) and extends its lead in consecutive years.
Stata is also pulling away from the pack, which is interesting given the large number of similarities between R and Stata (see R for Stata Users, Ch. 1 for details). Systat is making a notable comeback after stagnating from roughly 1998-2002. The SPSS Corporation bought Systat in 1995 and did little to improve it. They sold it in 2002 to Cranes Software, who has been marketing it to the academia at an aggressive price.
The extremely low usage of Statistica is notable because of its relatively high ranking in some polls such as the KDnuggets one shown in the surveys section below. The ranking achieved in polls is a function of how effective a company is in getting its customers to participate. All companies do their best to get the word out, and some may be much better at it than others. Google scholar hits, on the other hand, are immune to such manipulations. Not shown is the Number Cruncher Statistical System (NCSS). Its use is so low that I’m not tracking it, but a manual search of the 2012 data showed that at 800 hits, its use was almost double that of Statistica’s at 465 hits.
The number of books published on each software reflects their relative popularity. Amazon.com offers an advanced search method which works well for all the software except R. I configured it with the following parameters:
Title: SAS -excerpt -chapter -changes [using SAS as an example]
Subject: Computers & Internet
Format: All formats
Publication Date: After September, 2001 [i.e. 10 years before the search on 10/13/2011]
Since it’s difficult to determine how many books use a particular software in its examples, I searched for books that included the software in the title. SAS has many manuals for sale as individual chapters or excerpts. Luckily, they contain “chapter” or “excerpt” in their title so I excluded them using the minus sign, e.g. “-excerpt”. SAS also has short “changes and enhancements” booklets that the other packages release only in the form of flyers and/or web pages so I excluded “changes” as well.
SAS and SPSS both have many versions of the same book or manual still for sale. For example, Marija Norusis’ 3 books on SPSS appear 20 times for various versions of SPSS released in the last 10 years. The SAS and SPSS numbers are both somewhat inflated as a result. Limiting the search to books published in the last 10 years mitigated this problem somewhat, but the SAS and SPSS figures are probably both still somewhat exaggerated.
The count of R books came from http://www.r-project.org/doc/bib/R-books.html. This list does contain seven books on S that are older but still relevant. Version numbers do not appear in any book titles so R avoids the over-counting problem that plagued my count of SAS and SPSS manuals. The most surprising aspect of the result (Figure 3) was how extremely dominant the top few packages are and that three well known packages had no books at all written about them (BMDP, Statistica, Systat). Revolution R and R-PLUS have no books with their names in the titles, but of course the books on R apply to them as well.
Another measure of software popularity is the number of other web pages that contain links that point to the software’s main web site. Figure 4 provides those numbers, recorded using Google on January 5, 2012.
Now that SPSS is part of IBM, it dominates the results. This reflects the wide range of products that IBM sells, including computer hardware and services that have nothing to do with data analysis. However, the older SPSS.com website no longer shows up early in a web search and the IBM site that it redirects to has a tiny incoming link measure since it is not meant to be a direct link.
R is next in line with a little over half of IBM’s measure, followed by SAS with well less than R’s value. The other software follows in the order that I suspect is reflective of their respective market shares. Revolution R Enterprise and R-PLUS are commercial versions of R that are relatively quite new to the market. WPS is an implementation of the SAS Language and Carolina is a SAS-to-Java compiler.
The number of incoming links is an important part of Google’s famous PageRank algorithm (http://en.wikipedia.org/wiki/PageRank). PageRank is made more useful for searching by (among other things) weighting the importance of each link. Links from major sites like WikiPedia would carry far more weight than would a link from a professor’s course syllabus. The practical range of PageRank is from 1 to 10. Figure 9 plots this data (collected on on January 4, 2012). The software appear in tiers, with the two dominant players, SAS and SPSS (IBM), at the highest, and their well-known alternatives one level down. I find it odd that Stata is not in this level. At the very bottom are the World Programming System (WPS) and Carolina, two companies that use the SAS language. There have been quite a few changes in this ranking since last year, with SAS, SPSS and Revolution Analytics moving up one point and R, Stata and Carolina moving down one point. The R-PLUS site maintained its PageRank of 5 this year, which is a bit surprising given that many of its links are broken, and it is in its fourth year of saying, “Be the first to get R-PLUS 3.3″
On Internet blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Therefore, the number of bloggers writing about analytics software has potential as a measure of popularity or market share. Unfortunately, counting the number of relevant blogs is often a difficult task. General purpose software such as Java, Python, the C language variants and MATLAB have many more bloggers writing about general programming topics than just analytics. But separating them out isn’t easy. The name of a blog and the title of its latest post may not give you a clue that it routinely includes articles on analytics.
Another problem arises from the fact that what some companies would write up as a newsletter, others would do as a set of blogs, where several people in the company each contribute their own blog, but they’re also combined into a single company blog. Statsoft and Minitab offer examples of this. So what’s really interesting is not company employees who are assigned to write blogs, but rather those written by outside volunteers. In a few lucky cases, lists of such blogs are maintained, usually by blog consolidators, who combine many blogs into a large “metablog.” All I have to do is find such lists and count the blogs. I don’t attempt to extract the few vendor employees that I know are blended into such lists. I only skip those lists that are exclusively employee-based (or very close to it). The results are shown in Table 1.
Number Software of Blogs Source R 550 R-Bloggers.com Python 60 SciPy.org SAS 40 PROC-X.com, sasCommunity.org Planet Stata 11 Stata-Bloggers.com
Table 1. Number of blogs devoted to each software package on April 7, 2014, and the source of the data.
R’s 550 blogs is quite an impressive number. For Python, I could only find that list of 60 that were devoted to the SciPy subroutine library. Some of those are likely cover topics besides analytics, but to determine which never cover the topic would be quite time consuming. The 40 blogs about SAS is still an impressive figure given that Stata was the only other company that even garnered a list anywhere. That list is at the vendor itself, Statacorp, but it consists of non-employees except for one.
While searching for lists of blogs on other software, I did find individual blogs that at least occasionally covered a particular topic. However, keeping this list up to date is far too time consuming given the relative ease with which other popularity measures are collected.
If you know of other lists of relevant blogs, please let me know and I’ll add them. If you’re a software vendor employee reading this, and your company does not build a metablog or at least maintain a list of your bloggers, I recommend taking advantage of this important source of free publicity.
Discussion Forum Activity
There are some stable and objective measures regarding analytic software. Schwartz (2009) suggested estimating relative popularity by plotting the amount of email discussion devoted to each. The most widely used packages all have discussion lists, or “listservs” devoted to them. The less popular ones either do not have such discussions or, like the lists for Minitab or S-PLUS, may have only a dozen or so emails per year. Some software packages have multiple discussion lists. For example, there are 25 devoted to using R (http://www.r-project.org/mail.html). Topics range from general help to various focused areas such as graphics, mapping, ecology, epidemiology, etc. . A broader list, including a version of R-Help in Spanish, lists 48 discussions (https://stat.ethz.ch/mailman/listinfo).
Figure 1a shows the level of activity on only each main discussion listserv in a typical month (i.e. forums, news groups and Google groups are excluded). Each point represents the sum of the 12 monthly counts that occurred in that year. This plot contains data through the end of 2012. If you read this article in previous years, this plot used to display the mean number of emails per month rather than the sum. Therefore the scale of the y-axis is different but the relative locations of the points are virtually identical. I made this change to enable better a better comparison to discussion forums (e.g. Fig. 5a).
We can see that discussion of R has grown the most rapidly and, for the past few years, R is the most discussed software by an almost two-to-one margin. In recent years, it is followed by Stata, SAS and SPSS, respectively.
Stata showed steady discussion growth until it passed SAS in 2010.
SAS saw rapid growth in its discussion until 2006 when it leveled off and then declined. That decline coincided with the strong growth of both R and Stata, offering competition to SAS.
SPSS held steady at a low rate across the time frame, which may be attributable to its great ease of use relative to the other packages. With both the interface and the documentation aimed at people who prefer GUIs over programming, there’s less need to ask how to do variations on an analysis. In fact, there’s less ability to do such variations. As a result, I doubt SPSS’ low showing in this graph is indicative of its popularity or market share.
It would be interesting to see what topics were most discussed on each list. The only such analysis of which I am aware was done by Arthur Tabachnek (2010) for the SAS list. The most popular topic in 2009 turned out to be…R! You can read his full analysis here under slides from the 2010 session.
From 2011 onward, R and Stata joined SAS in the decline in listserv discussion. Given the sharp increase in the popularity of business analytics, Big Data, and so on, it is unlikely that people are using or talking about these tools less. Instead, alternative forums of discussion have appeared. The site Stack Overflow (http://stackoverflow.com) covers a wide range of programming and statistical topics, while its sister site, Cross Validated (http://stats.stackexchange.com/), focuses only on statistical analysis. A third site, Talk Stats (http://www.talkstats.com), also focuses on statistical analysis. At all three sites, users tag their topics making it particularly easy to focus searches. Figure 5b shows the software people are discussing there.
We can see that the discussion of R is dramatically higher than the other packages, which don’t differ very much among themselves. Much of this difference is due to the influence of Stack Overflow, reflecting the vastly greater popularity of R as a programming language. However, even removing that effect, it is easy to see that R still dominates the discussions on the more statistically-oriented forums. This data is cumulative, but we can get a yearly view of just two of the tags: R and SAS (Figure 5c).
We see that discussion of SAS and R were roughly comparable until mid-2009 when the discussion of R began its very rapid climb. The page that provides this data does not display data for SPSS or Stata. The amount of data may be too low; no message provides the reason (see http://hewgill.com/~greg/stackoverflow/stack_overflow/tags).
Other popular discussion forum sites are LinkedIn.com and Quora.com. Neither of these sites make it easy to count number of posts, but they do display the number of people who have joined discussion groups (Figure 5d).
In Figure 1d we get a better view of corporate software use. I do not know the ratio of corporate to academic use of LinkedIn, but among the academics I do know (quite a few) they use it very little. In this world, SAS is the leader with R close behind. It’s interesting to see SPSS with a 50% lead over Stata; it was also slightly higher in Fig. 1b. Remember these are people who have joined a group, not necessary people who are talking as the previous two figures were. Still, group membership should be a reasonable proxy for popularity or market share.
This section is planned for future expansion. Stay tuned.
The TIOBE Community Programming Index ranks the popularity of programming languages, but from a programming language perspective rather than as analytical software (http://www.tiobe.com). It extracts measurements from blogs, entries in Wikipedia, books on Amazon, and search engine results, and combines them into a single index. In January 2012, they ranked R in 24th place and SAS at 31st. However, by February 2014, the two had reversed positions with SAS in 21st place and R in 44th.
The only other language that focuses on data analysis that is ranked in the top 100 are S and S-PLUS (R is an implementation of the S language, as is S-PLUS). In previous years SPSS ranked in the 50-100 group but by February of 2013 it had dropped out (and is still out in February 2014.)
The Transparent Language Popularity Index is very similar to the TIOBE Index with except that its ranking software, algorithm and data are published for all to see. Their latest figures on 2/15/2014 were from July of 2013, at which time it ranked R in 14th place and SAS in 31st. This index also ranks R as a scripting language, where it is in 6th place after tools like PHP, Python and Perl. SAS is also ranked 5th in the “Other” category, when compared to languages such as like COBOL or PL/SQL. While these two additional areas may seen irrelevant to data analysis, it’s good to know both these tools have more flexibility than most other domain specific languages which focus on data analysis.
Langpop.com also ranks programming languages (http://langpop.com/) in a variety of interesting ways, but unfortunately their focus excludes statistical software.
Surveys of Use
One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics does a survey every other year asking a wide range of questions regarding data mining. Figure 6a shows the results of the question about the tools that respondents reported using in 2013.
We see that R comes out on top by a wide margin, with 70% of data miners using it. SPSS, RapidMiner, SAS and Weka all follow with only around 30% of users. The entire report contained over 40 questions on topics such as algorithms used, fields, challenges, data, impact of the economy on the field, and more. It’s interesting to note that while this survey is aimed at data miners, SPSS and SAS are used more often than their more expensive products aimed specifically at data mining, IBM SPSS Modeler and SAS Enterprise Miner.
The results of a similar survey done by the data mining web site KDnuggets in 2013 are shown in Figure 6b. This one shows RapidMiner in first place with 39.2% of users reporting having used it for a real project. R follows closely behind with 37.4% of users. There’s quite a large gap in which Excel resides, with 28% of users. Weka/Pentaho and Python are tied at 14.3%, followed by the rest. Note that both RapidMiner and KNIME are listed twice, once in their free version and again in commercial. While it’s tempting to add these two, there may be overlap for organizations that use both.
It’s interesting to note that four of the top five packages used were open source.
The KDnuggets site conducted similar poll, this time asking, “What programming languages you used for data mining / data analysis in the past 12 months?” R dominated this poll with over 60% of respondents, as shown in Figure 6c. Python and SQL followed with around 37% each.
O’Reilly Media conducted a survey of conference attendees in 2012 and 2013 of the Strata Conference: Making Data Work and Strata + Hadoop World. The top bar in Figure 6d shows that 57% of respondents listed some form of data analysis as their primary job. SQL is listed as the top tool with 71% of respondents using it. This indicates that most attendees stored their data in relational databases. R came out as the top tool for advanced analytics, with 43% of respondents using it. Given that some respondents were attending Hadoop world, it’s not surprising that both Hadoop and the related Mahout came out higher here than in most other measures of popularity discussed in this paper. The fact that SAS/SPSS came out the bottom is another clear indication that this is not a random sample of analytics users.
Lavastorm, Inc. conducted a survey of analytic communities including LinkedIn’s Lavastorm Analytics Community Group, Data Science Central and KDnuggets. The results were published in March, 2013, and the bar chart of “self-service analytic tool” usage among their respondents is shown in Figure 6e. Excel comes out as the top tool, with 75.6% of respondents reporting its use. While other surveys show Excel use similarly high, some don’t include it at all, leaving us to wonder what survey researchers and respondents were thinking regarding this relatively low-powered tool.
R comes out as the top advanced analytics tool with 35.3% of respondents, followed closely by SAS. MS Access’ position in 4th place is a bit of an outlier as no other surveys include it at all. Lavastorm comes out with a much higher market share (3.4%) than the KDNuggests poll indicated (0.4%), but that’s hardly a surprise given than the survey was aimed at the Lavastorm’s LinkedIn community group.
IT Research Firms
IT research firms study software products and corporate strategies and provide their opinions on each in reports they sell to their clients. Each company has its own criteria for rating companies, so they don’t always agree, but I find the reports extremely interesting reading.
Gartner, Inc. is one of the companies that provides such reports. The “Magic Quadrant” from their report, Advanced Analytics for Business Analysts is shown in Figure 7a. Since it rates companies, strictly open source software such as R, Python and Java are not shown. IBM (SPSS) and SAS are the leading companies, which comes as no surprise. I was, however, surprised by the inclusion of RapidMiner and KNIME in the Leaders’ quadrant. Both products are available in free, open source versions and in commercial versions. Both have done also very well in user surveys (see Surveys of Use section below). However, user surveys and analyst reports often disagree.
Another surprise in this figure is the inclusion of Megaputer (PolyAnalyst). There were actually zero jobs for that product in Figure 1b. But they’re shown as being in the same neighborhood as Oracle and Microsoft!
Revolution Analytics, a company whose main business is providing a commercial version of R, is on the edge of the Leaders’ quadrant. The full report provides an insightful analysis of the strengths and weaknesses of each company’s offerings.
Thanks to Alteryx, Inc. the 2014 Gartner Group report on Advanced Analytics is available here. Note that the links that provide such reports for free tend to expire often but you can usually find each by searching the internet for the report names.
Forrester Research, Inc. is another company that provides similar reports. It’s “Wave” plot from their report, “The Forrester Wave: Big Data Predictive Analytics Solutions, Q1 2013″ is shown in Figure 7b.
Again, IBM (SPSS) and SAS are the strongest companies, but that seems to be all that the two reports seem to agree on! The reports emphasize different aspects of the companies being rated, which accounts for the radically different plots. You can read the full Forrester report, compliments of SAP, here.
Sales & Downloads
Sales figures reported by some commercial vendors include products that have little to do with analysis. Many vendors don’t release sales figures, or they release them in a form that combines many different products, making the examination of a particular product impossible. For open source software such as R (Ihaka and Gentleman 1996) you could count downloads, but one confused person can download many copies, inflating the total. Conversely, many people can use a single download on a server, deflating it.
Download counts for the R-based Bioconductor project are located at http://www.bioconductor.org/packages/stats/. Similar figures for downloads of Stata add-ons (not Stata itself) are available at http://fmwww.bc.edu/fmrc/reports/Report.SSC.html. A list of Stata repositories is available at http://stata.com/links/resources2.html. The many sources of downloads both in repositories and individuals’ web sites makes counting downloads a very difficult task.
Kaggle.com is a web site that sponsors data analysis contests. People post data analysis problems there along the amount of money they are willing pay the person or team who solves their problem the best. Figure 8 shows the software used by the data analysts working on the problems. R is in the lead by a wide margin. R’s dominance is even greater among the contest winners, over 50% of whom used R. A potential source of bias in these figures is that the licenses of most proprietary software prohibits its use for the benefit of outside organizations (universities can help federal grant-providing agencies such as NSF and NIH, but cannot even solve problems for government agencies in general or nonprofits). However, I manage the research software site licenses at the University of Tennessee, and I can attest to the fact that people are often unaware of this limitation. (Note that as of 4/11/2014 this graph of 2011 data is still Kaggle’s most current graph.)
Growth in Capability
The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.
Figure 9 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.998). The right-most point is for version 3.0.2, the last version released in 2013.
To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contains around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2013, R added 835 packages, counting only CRAN, or approximately 17,390 functions. During 2013 alone, R added more functions/procs than SAS Institute has written in its entire history!
Of course SAS and R commands are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do. However, R functions can nest inside one another, creating nearly infinite combinations. Also, SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would be happy to list it here. While the comparison is not perfect, it does provide an interesting perspective on the size and growth rate of R.
As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in
Figure 8. A program run on 4/11/2014 counted 7,380 R packages at all major repositories, 5,339 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 38% higher on the y-axis than the one shown in Figure 8. The total growth in R functions for 2013 was approximately 17,390 * 1.38 or 23,998.
As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. Those are not easily counted.
What’s the total number of R functions? The Rdocumentation site shows the latest figures counts of both packages and functions on CRAN. They indicate that there are an average of 20.826 functions per package. Since a program on 4/7/2014 counted 7,364 R packages at all major repositories, on that date there were approximately 153,696 total functions in R, over an order of magnitude more than commands in SAS.
I previously included on Google Trends. That site tracks not what’s actually on the Internet via searches, but rather the keywords and phrases that people are entering into their Google searches. That ended up being so variable as to be essentially worthless. For an interesting discussion of this topic, see this article by Rick Wicklin.
[This section is needs an overhaul due to the new software added 2/20/14.] I’m interested in other ways to measure software popularity. If you have any ideas on the subject, please contact me at email@example.com.
If you are a SAS or SPSS user interested in learning more about R, you might consider my book, R for SAS and SPSS Users. Stata users might want to consider reading R for Stata Users, which I wrote with Stata guru Joe Hilbe. I also teach workshops quarterly on these topics with Revolution Analytics.
I am grateful to the following people for their suggestions that improved this article: John Fox (2009) provided the data on R package growth; Marc Schwartz (2009) suggested plotting the amount of activity on e-mail discussion lists; Duncan Murdoch clarified the pitfalls of counting downloads; Martin Weiss pointed out both how to query Statlist for its number of subscribers; Christopher Baum provided information regarding counting Stata downloads; John (Jiangtang) HU suggeseted I add more detail from the TIOBE index; Andre Wielki suggested the addition of SAS Institute’s support forums; Kjetil Halvorsen provided the location of the expanded list of Internet R discussions; Dario Solari and Joris Meys suggested how to improve Google Insight searches; Keo Ormsby provded useful suggestions regarding Google Scholar; Karl Rexer provided his data mining survey data; Gregory Piatetsky-Shapiro provided his KDnuggets data mining poll; Tal Galili provided advice on blogs and consolidation, as well as Stack Exchange and Stack Overflow; Patrick Burns provided general advice; Nick Cox clarified the role of Stata’s software repositories and of popularity itself; Stas Kolenikov provided the link of known Stata repositories; Rick Wicklin convinced me to stop trying to get anything useful out of Google Insights; Drew Schmidt automated the collection of the data in Figures 7a and 7b; Francois Briatte provided the link that creates Figure 1c; Rasmus Bååth provided the median number of functions in an R package.
J. Fox. Aspects of the Social Organization and Trajectory of the R Project. R Journal, http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, 1996.
R. Muenchen, R for SAS and SPSS Users, Springer, 2009
R. Muenchen, J. Hilbe, R for Stata Users, Springer, 2010
M. Schwartz, 1/7/2009, http://tolstoy.newcastle.edu.au/R/e6/help/09/01/0517.html
BMDP, Carolina, JMP, Minitab, R-PLUS, Revolution R, SAS, SAS Enterprinse Miner, IBM SPSS Modeler, IBM SPSS Statistics, Stata, Statistica, Systat and WPS are registered trademarks of their respective companies.
Copyright 2010-2014 Robert A. Muenchen, all rights reserved.