by Robert A. Muenchen
This article, formerly known as The Popularity of Data Analysis Software, presents various ways of measuring the popularity or market share of software for advanced analytics. Such software is also referred to as tools for data science, statistical analysis, machine learning, artificial intelligence, predictive analytics, and business analytics and is also a subset of business intelligence.
Updates: The most recent update was to the Job Advertisement section on 10/10/2022. Other sections date to 2019. I announce the updates to this article on Twitter: http://twitter.com/BobMuenchen
When choosing a tool for data analysis, now more commonly referred to as analytics or data science, there are many factors to consider:
- Does it run natively on your computer?
- Does the software provide all the methods you need? If not, how extensible is it?
- Does its extensibility use its own unique language or an external one (e.g. Python, R) that is commonly accessible from many packages?
- Does it fully support the style (programming, menus and dialog boxes, or workflow diagrams) that you like?
- Are its visualization options (e.g., static vs. interactive) adequate for your problems?
- Does it provide output in the form you prefer (e.g., cut & paste into a word processor vs. LaTeX integration)?
- Does it handle large enough data sets?
- Do your colleagues use it so you can easily share data and programs?
- Can you afford it?
There are many ways to measure popularity or market share, and each has its advantages and disadvantages. In rough order of the quality of the data, these include:
- Job Advertisements
- Scholarly Articles
- IT Research Firm Reports
- Surveys of Use
- Discussion Forum Activity
- Programming Popularity Measures
- Sales & Downloads
- Competition Use
- Growth in Capability
Let’s examine each of them in turn.
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what will become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g., scikit-learn, Apache Spark), while others are used in data science jobs and, more broadly, in report-writing jobs (e.g., SAS, Tableau). General-purpose languages (e.g., Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that require them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the results in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on October 5, 2022. To measure percent change, I compare that to data collected on May 27, 2019. One might think that a sample on a single day might not be very stable, but they are. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002. I occasionally double-check some counts a month or so later and always get similar figures.
The number of jobs covers a very wide range from zero to 164,996, with a mean of 11,653.9 and a median of 845.0. The distribution is so skewed that placing them all on the same graph makes reading values difficult. Therefore, I split the graph into three, each with a different scale. A single plot with a logarithmic scale would be an alternative, but when I asked some mathematically astute people how various packages compared on such a plot, they were so far off that I dropped that approach.
Figure 1a shows the most popular tools, those with at least 10,000 jobs. SQL is in the lead with 164,996 jobs, followed by Python with 150,992 and Java with 113,944. Next comes a set from C++/C# at 48,555, slowly declining to Microsoft’s Power BI at 38,125. Tableau, one of Power BI’s major competitors, is in that set. Next comes R and SAS, both around 24K jobs, with R slightly in the lead. Finally, we see a set slowly declining from MATLAB at 17,736 to Scala at 11,473.
Figure 1b covers tools for which there are between 250 and 10,000 jobs. Alteryx and Apache Hive are at the top, both with around 8,400 jobs. There is quite a jump down to Databricks at 6,117 then much smaller drops from there to Minitab at 3,874. Then we see another big drop down to JMP at 2,693 after which things slowly decline until MLlib at 274.
The least popular set of software, those with fewer than 250 jobs, are displayed in Figure 1c. It begins with DataRobot and SAS’ Enterprise Miner, both near 182. That’s followed by Apache Mahout with 160, WEKA with 131, and Theano at 110. From RapidMiner on down, there is a slow decline until we finally hit zero at WPS Analytics. The latter is a version of the SAS language, so advertisements are likely to always list SAS as the required skill.
Several tools use the powerful yet easy workflow interface: Alteryx, KNIME, Enterprise Miner, RapidMiner, and SPSS Modeler. The scale of their counts is too broad to make a decent graph, so I have compiled those values in Table 1. There we see Alteryx is extremely dominant, with 30 times as many jobs as its closest competitor, KNIME. The latter is around 50% greater than Enterprise Miner, while RapidMiner and SPSS Modeler are tiny by comparison.
Let’s take a similar look at packages whose traditional focus was on statistical analysis. They have all added machine learning and artificial intelligence methods, but their reputation still lies mainly in statistics. We saw previously that when we consider the entire range of data science jobs, R was slightly ahead of SAS. Table 2 shows jobs with only the term “statistician” in their description. There we see that SAS comes out on top, though with such a tiny margin over R that you might see the reverse depending on the day you gather new data. Both are over five times as popular as Stata or SPSS and ten times as popular as JMP. Minitab seems to be the only remaining contender in this arena.
|Software||Jobs only for “Statistician”|
Next, let’s look at the change in jobs from the 2019 data to now (October 2022), focusing on software that had at least 50 job listings back in 2019. Without such a limitation, software that increased from 1 job in 2019 to 5 jobs in 2022 would have a 500% increase but still would be of little interest. Percent change ranged from -64.0% to 2,479.9%, with a mean of 306.3 and a median of 213.6. There were two extreme outliers, IBM Watson, with job growth of 2,480%, and Databricks, at 1,323%. Those two were so much greater than the rest that I left them off of Figure 1d to keep them from compressing the remaining values beyond legibility. The rapid growth of Databricks has been noted elsewhere. However, I would take IBM Watson’s figure with a grain of salt as its growth in revenue seems nowhere near what Indeed.com’s job figure seems to indicate.
The remaining software is shown in Figure 1d, where those whose job market is “heating up” or growing are shown in red, while those that are cooling down are shown in blue. The main takeaway from this figure is that nearly the entire data science software market has grown over the last 3.5 years. At the top, we see Alteryx, with a growth of 850.7%. Splunk (702.6%) and Julia (686.2%) follow. To my surprise, FORTRAN follows, having gone from 195 jobs to 1,318, yielding growth of 575.9%! My supercomputing colleagues assure me that FORTRAN is still important in their area, but HPC is certainly not growing at that rate. If any readers have ideas on why this could occur, please leave your thoughts in the comments section below.
SQL and Java are both growing at around 537%. From Dataiku on down, the rate of growth slows steadily until we reach MLlib, which saw almost no change. Only two packages declined in job advertisements, with WEKA at -29.9% and Theano at -64.1%.
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year, I collect data again for the previous years (with one exception noted below).
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the data was collected on 3/28/2019.
SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease of use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.
The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.
The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.
To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.
I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open-source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages is growing, while the use of the older ones is shrinking quite rapidly.
Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools that are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that are much less expensive – or free – depending on how you use them (KNIME desktop is free, but server is not; RapidMiner is free for analyzing fewer than 10,000 cases).
Another interesting comparison in Figure 2b is JASP and jamovi. Both are open-source tools that focus on statistics rather than machine learning or artificial intelligence. They both use graphical user interfaces (GUIs) in a style that is similar to SPSS. Both also use R behind the scenes to do their calculations. JASP emphasizes Bayesian Analysis and hides its R code; jamovi has a more frequentist orientation, it lets you see its R code, and it lets you execute your own R code directly from within it. JASP currently has nine times as many citations here, though jamovi’s use is growing much more rapidly.
Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open-source model a few months ago.
While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.
The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g., Mahout down, Spark up).
Of course, another summary is: cheap (or free) up; expensive down. Of the growing packages, 13 out of 17 are available in open source. Of those in decline, only 5 out of 13 are open source.
Statistics software has been around much longer than AI/ML software, starting back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.
At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest-growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP, with 223% growth, and jamovi, with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.
Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.
In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we see that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).
The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d, I have plotted the same scholarly-use data for 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009, and its use is in sharp decline. SAS never came close to SPSS’s level of dominance, and its use peaked around 2010. GraphPad Prism followed a similar pattern, though it peaked a bit later, around 2013.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 2015. The result is shown in Figure 2e.
Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.
While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in March 2019. Since Google’s algorithm changes, preventing the new data from matching exactly with the old, this new data starts in 2015 so the two sets overlap. SPSS is not shown on this graph because its dominance would compress the y-axis, making trends in the others harder to see. However, keep in mind that despite SPSS’s 39% drop from 2017 to 2018, its use is still 66% higher than R in 2018! Apparently, people are willing to pay for ease of use.
In Figure 2f, we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect R use to stabilize at this level while the use of Stata continues to slowly decline.
SPSS’s long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently, these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.
These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.
IT Research Firms
IT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open-source software that has no specific company backing, such as R, Python, or jamovi. Even open-source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open-source software into their proprietary ones.
While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.
Gartner, Inc. is one of the research firms that write such reports. Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” Such software performs a wide range of tasks, including data importation, preparation, exploration, visualization, modeling, and deployment.
Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.
The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upper-right corner of the plot, the better the combined score.
- RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in the ability to execute, while KNIME offers more vision. Both offer free and open-source versions, but the companies differ quite a lot on how committed they are to the open-source concept. KNIME’s desktop version is free and open source, and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases), and as they add new features, they usually come only via a commercial license with “difficult-to-navigate pricing conditions.” These two offer very similar workflow-style user interfaces and have the ability to integrate many open-source tools into their workflows, including R, Python, Spark, and H2O.
- Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
- SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixed-effects linear models. Those are models are among the very few that R often fails to solve.
The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.
- Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources, such as images, video, and the Internet of Things (IoT). It has also opened up more to open-source deep learning projects.
- H2O.ai is also in the Visionaries quadrant. This is the company behind the open-source H2O software, which is callable from many other packages or languages, including R, Python, KNIME, and RapidMiner. While its own menu-based interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for non-coders. H2O’s strength is in modeling, but it is lacking in data access and preparation, as well as model management.
- IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
- Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of open-source tools into its Apache Spark-based system. However, it scored poorly in the predictability of costs.
- Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
- Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a non-trivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
- Microsoft offers several products, but only its cloud-only Azure Machine Learning was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for ease of use, scalability, and strong partnerships. However, it is weak in automated modeling, and AML’s relation to various other Microsoft components is overwhelming (the same problem as Google’s toolset).
Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans or vision.
- Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflow-based user interface is very similar to that of KNIME and RapidMiner, and it, too, gets top marks in ease of use. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geo-coded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting-edge modeling methods such as deep learning, auto-modeling, and the Internet of Things.
- Dataiku strengthened its ability to execute significantly from last year. It added better scalability to its ease of use and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”
Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (including the former Angoss), Domino, and SAP.
- Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to production-oriented organizations. This is a tool exclusively for experts in both programming and data science.
- Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on ease of use and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
- Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools but low marks for data access and prep, integrating models into day-to-day operations, and customer support.
- SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.
Forrester Research, Inc. is a leading global research and advisory firm that reviews data science, software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.
Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software, like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to split the two types of software into separate reports.
The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018, covers software that is controllable by various means such as menus, workflows, wizards, or code (as of 23/22/2019, available free here). Forrester plans to cover tools for automated modeling in a separate report, due out in 2019. Given that automation is now a widely adopted feature of the several companies covered in this report, that seems like an odd approach.
Forrester divides the vendors into four categories: Leaders, Strong Performers, Contenders, and Challengers.
In the Leaders category, they include IBM, while Gartner viewed them as a middle-of-the-pack Visionary. Forrester and Gartner both view SAS and RapidMiner as leaders.
The Strong Performers category includes KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. Forrester has KNIME and SAP next to each other in this category, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here, too, with a similar rating to Gartner.
The Contenders segment contains Microsoft and Mathworks in positions similar to Gartner’s. Fico is here, too; Gartner did not evaluate them.
Forrester’s Challengers segment includes World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems. Neither was considered by Gartner.
The Forrester Wave: Notebook-Based Solutions, Q3, 2018 reviews software controlled by notebooks, which blend programming code and output in the same window (as of 3/22/2019, available here).
Forrester rates some of the notebook-based vendors very differently than Gartner. Here Domino Data Labs is a Leader, while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a Leader, though its strength in this market is minimal.
In the Strong Performers category are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this category; neither was reviewed by Gartner. Cloudera is here as well; it, too, was left out by Gartner.
Forrester’s Contenders category contains Google in a similar position compared to Gartner’s analysis. Anaconda is here, too, in a position quite a bit higher than in Gartner’s plot.
The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.
As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open-source software) before making any purchasing decisions.
Surveys of Use
Survey data adds additional information regarding software popularity, but they are commonly done using “snowball sampling,” in which the survey provider tries to widely distribute the link, and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often limited because the questions are short and precise (e.g., “tools for data mining” or “program languages for data mining”) and responding requires just a few mouse clicks rather than the commitment required to place a job advertisement or publish a scholarly article, book, or blog post. As a result, it’s not unusual to see market share jump 100% or drop 50% in a single year, which is very unlikely to reflect changes in actual use.
Rexer Analytics conducts a survey of data scientists every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 4a shows the tools that the 1,220 respondents reported using in 2015.
We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.
The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be overcounting. However, the survey also asked, “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?” Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 4b. Since other software combinations are likely, e.g., SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.
In this view, we see R even more dominant, with a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME, however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g., research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration, and performance rather than limiting analytic features or access to popular data formats.
The results of a similar poll done by the KDnuggets.com website in May 2015 are shown in Figure 4c. This one shows R in first place, with 46.9% of users reporting having used it for a “real project.” RapidMiner, SQL, and Python follow quite a bit lower, with around 30% of users. Then at around 20% are Excel, KNIME, and HADOOP. It’s interesting to see that these survey results reverse the order in the previous one, showing RapidMiner as being more popular than KNIME. Both are still the top two “point-and-click” type packages generally used by non-programmers.
O’Reilly Media conducts an annual Data Science Salary Survey, which also asks questions about analytics tools. Although the full report of results As their report notes, “O’Reilly content—in books, online, and at conferences—is focused on technology, in particular new technology, so it makes sense that our audience would tend to be early adopters of some of the newer tools.” The results from their “over 600” respondents are shown in figures 6d and 6e.
The O’Reilly results have SQL in first place, with 70% of users reporting it, followed closely by Excel. Python and R follow, seemingly tied for third place with 55%. However, Python also appears in 6th place with its subroutine libraries NumPy, etc., and R’s popular ggplot package appears in 7th place, with around 38% market share. The first commercial package with deep analytic capabilities is SAS, in 23rd place! This emphasizes that the O’Reilly sample is heavily weighted toward their usual open-source audience. Hopefully, in the future, they will advertise the survey to a wide audience and do so as more than just a salary survey. Tool surveys gain additional respondents since they are advertised by advocates of the various tools (vendors, fans, etc.)
Lavastorm, Inc. conducted a survey of analytic communities, including LinkedIn’s Lavastorm Analytics Community Group, Data Science Central, and KDnuggets. The results were published in March 2013, and the bar chart of “self-service analytic tool” usage among their respondents is shown in Figure 6f. Excel comes out as the top tool, with 75.6% of respondents reporting its use.
R comes out as the top advanced analytics tool, with 35.3% of respondents, followed closely by SAS. MS Access’ position in 4th place is a bit of an outlier as no other surveys include it at all. Lavastorm comes out with 3.4%, while other surveys don’t show them at all. That’s hardly a surprise, given that the survey was aimed at Lavastorm’s LinkedIn community group.
The number of books that include a software’s name in its title is particularly useful information since it requires a significant effort to write one, and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics. Amazon.com offers an advanced search method that works well for all the software except R and the general-purpose languages such as Java, C, and MATLAB. I did not find a way to easily search for books on analytics that used such general-purpose languages, so I’ve excluded them in this section.
The Amazon.com advanced search configuration that I used was (using SAS as an example):
Title: SAS -excerpt -chapter -changes -articles Subject: Computers & Technology Condition: New Format: All formats Publication Date: After January, 2000
The “title” parameter allowed me to focus the search on books that included the software names in their titles. Other books may use a particular software in their examples, but they’re impossible to search for easily. SAS has many manuals for sale as individual chapters or excerpts. They contain “chapter” or “excerpt” in their title, so I excluded them using the minus sign, e.g., “-excerpt”. SAS also has short “changes and enhancements” booklets that the developers of other packages release only in the form of flyers and/or web pages, so I excluded “changes” as well. Some software listed brief “articles” which I also excluded. I did the search on June 1, 2015, and I excluded excerpts, chapters, changes, and articles from all searches.
“R” is a difficult term to search for since it’s used in book titles to indicate a Registered Trademark, as in “SAS(R)”. Therefore I verified all the R books manually.
The results are shown in the table immediately below, where it’s clear that a very small number of analytics software packages dominate the world of book publishing. SAS has a huge lead with 576 titles, followed by SPSS with 339 and R with 240. SAS and SPSS both have many versions of the same book or manual still for sale, so their numbers are both inflated as a result. JMP and Hadoop both had fewer than half of R’s count, and then Minitab and Enterprise Miner had fewer than half again as many. Although I obtained counts on all 27 of the domain-specific (i.e., not general-purpose) analytics software packages or languages shown in Figure 2a, I cut the table off at software that had 8 or fewer books to save space.
Software Number of Books SAS 576 SPSS Statistics 339 R 240 [Corrected from blog post: 172] JMP 97 Hadoop 89 Stata 62 Minitab 33 Enterprise Miner 32
Table 1. The number of books whose titles contain the name of each software package.
On Internet blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. Blog posts contain a great deal of information about their topic, and although it’s not as time-consuming as a book to write, maintaining a blog certainly requires effort. Therefore, the number of bloggers writing about analytics software has potential as a measure of popularity or market share. Unfortunately, counting the number of relevant blogs is often a difficult task. General-purpose software such as Java, Python, the C language variants, and MATLAB have many more bloggers writing about general programming topics than just analytics. But separating them out isn’t easy. The name of a blog and the title of its latest post may not give you a clue that it routinely includes articles on analytics.
Another problem arises from the fact that what some companies would write up as a newsletter, others would do as a set of blogs, where several people in the company each contribute their own blog. Those individual blogs may also be combined into a single company blog inflating the count further still. Statsoft and Minitab offer examples of this. So what’s really interesting is not company employees who are assigned to write blogs, but rather those written by outside volunteers. In a few lucky cases, lists of such blogs are maintained, usually by blog consolidators, who combine many blogs into a large “metablog.” All I have to do is find such lists and count the blogs. I don’t attempt to extract the few vendor employees that I know are blended into such lists. I only skip those lists that are exclusively employee-based (or very close to it). The results are shown here:
Number Software of Blogs Source R 550 R-Bloggers.com Python 60 SciPy.org SAS 40 PROC-X.com, sasCommunity.org Planet Stata 11 Stata-Bloggers.com
Table 2. Number of blogs devoted to each software package on April 7, 2014,
and the source of the data.
R’s 550 blogs is quite an impressive number. For Python, I could only find that list of 60 that were devoted to the SciPy subroutine library. Some of those are likely to cover topics besides analytics but to determine which never covers the topic would be quite time-consuming. The 40 blogs about SAS is still an impressive figure, given that Stata was the only other company that even garnered a list anywhere. That list is at the vendor itself, StataCorp, but it consists of non-employees except for one.
While searching for lists of blogs on other software, I did find individual blogs that at least occasionally covered a particular topic. However, keeping this list up to date is far too time-consuming, given the relative ease with which other popularity measures are collected.
If you know of other lists of relevant blogs, please let me know, and I’ll add them. If you’re a software vendor employee reading this, and your company does not build a metablog or at least maintain a list of your bloggers, I recommend taking advantage of this important source of free publicity.
Discussion Forum Activity
Another way to measure software popularity is to see how many people are helping one another use each package or language. While such data is readily available, it, too, has its problems. Menu-driven software like SPSS or workflow-driven software such as KNIME are quite easy to use and tend to generate fewer questions. Software controlled by programming requires the memorization of many commands and requires more support. Even within languages, some are harder to use than others, generating more questions (see Why R is Hard to Learn).
Another problem with this type of data is that there are many places to ask questions, and each has its own focus. Some are interested in a classical statistics perspective, while others have a broad view of software as general-purpose programming languages. In recent years, companies have set up support sites within their main corporate web site, further splintering the places you can go to get help. Usage data for such sites are not readily available.
Another problem is that it’s not as easy to use logic to focus on specific types of questions as it was with the data from job advertisements and scholarly articles discussed earlier. It’s also not easy to get the data over time to allow us to study trends. Finally, the things such sites measure include software group members (a.k.a. followers), individual topics (a.k.a. questions or threads), and total comments across all topics (a.k.a. total posts). This makes combining counts across sites problematic.
Two of the biggest sites used to discuss software are LinkedIn and Quora. They both display the number of people who follow each software topic, so combining their figures makes sense. However, since the sites lack any focus on analytics, I have not collected their data on general-purpose languages like Java, MATLAB, Python, or variants of C. The results of data collected on 10/17/2015 are shown here:
We see that R is the dominant software and that moving down through SAS, SPSS, and Stata results in a loss of roughly half the number of people in each step. Lavastorm follows Stata, but I find it odd that there was absolutely zero discussion of Lavastorm on Quora. The last bar that you can even see in this plot is the 62 people who follow Minitab. All the ones below that have tiny audiences of fewer than 10.
Next, let’s examine two sites that focus only on statistical questions: Talk Stats and Cross Validated. They both report the number of questions (a.k.a. threads) for a given piece of software, allowing me to total their counts:
We see that R has a 4-to-1 lead over the next most popular package, SPSS. Stata comes in at 3rd place, followed by SAS. The fact that SAS is in fourth place here may be due to the fact that it is strong in data management and report writing, which are not the types of questions that these two sites focus on. Although MATLAB and Python are general-purpose languages, I include them here because the questions on this site are within the realm of analytics. Note that I collected data on as many packages as were shown in the previous graph, but those not shown have a count of zero. Julia appears to have a count of zero due to the scale of the graph, but it actually had 5 questions on Cross Validated.
Programming Popularity Measures
Several websites rank the popularity of programming languages. Unfortunately, they don’t differentiate between general-purpose languages and application-specific ones used for analytics. However, it’s easy to choose the few analytics languages for their results.
The most comprehensive of these sites is the IEEE Spectrum Ranking. This site combines 12 metrics from 10 different sites. These include some of the measures discussed above, such as popularity on job sites and search engines. They also include fascinating and useful measures, such as how much new programming code was added to the popular GitHub repository in the last year. This figure shows their top 10 languages for 2015:
We see that R is in 6th place and that it has increased from 9th place in 2014. Not shown on this is SAS in 26th place. Python is ranked in 4th place, but that’s for all purposes, while the use of R is more focused on analytics. No other analytics-specific language makes it in their rankings at all. This ranking is based on a weighted composite score, and the site is interactive, allowing you to generate a ranking more suited to your needs.
The next most comprehensive analysis is provided by RedMonk. Their analysis is simple and objective. They plot the number of lines of code written using each language on the popular Github repository against the number of tagged comments on the discussion forum StackOverflow.com. Here is the result:
We can see that Redmonk’s approach shows R as a very popular language, around 12th place. Although a substantial amount of the metrics for Python, MATLAB, and Julia may be due to analytics use, we have no way of knowing how much.
The TIOBE Community Programming Index also ranks the popularity of programming languages. It extracts measurements from the 25 most popular search engines, including Google, YouTube, Wikipedia, and Amazon.com, and combines them into a single index. In their October 2015 rankings, they place R in 20th place and SAS in 23rd. Stata is in a bundle they call “the next 50” languages, whose popularity among general-purpose languages is so sparse that their relative rankings are too unstable to bother giving individual ranks. SPSS is a language they monitor, but it doesn’t make it into their top 100. This brings us to an important limitation of the Tiobe index: it searches for one single string: “X programming.” So if it didn’t find “SPSS programming,” then it doesn’t count. The complex searches that I used for jobs and scholarly articles were far more useful in estimating each package’s popularity. Another limitation of the Tiobe index is that it measures what is on the Internet now, so it’s a lagging indicator. There’s no way to plot trends without purchasing their data, which is quite expensive.
A very similar popularity index is PYPL Popularity of Programming Language. It only tracks the top 15 languages and in October 2015, it placed R in 11th place. It searches on the single string, “X tutorial” making it a leading indicator of what’s likely to be more popular in the future.
The Transparent Language Popularity Index is very similar to the TIOBE Index with, except that its ranking software, algorithm, and data are published for all to see. Work on this index ceased as of July 2013.
Sales & Downloads
Sales figures reported by some commercial vendors include products that have little to do with analysis. Many vendors don’t release sales figures, or they release them in a form that combines many different products, making the examination of a particular product impossible. For open-source software such as R, you could count downloads, but one confused person can download many copies, inflating the total. Conversely, many people can use a single download on a server, deflating it.
Download counts for the R-based Bioconductor project are located here. Similar figures for downloads of Stata add-ons (not Stata itself) are available here. A list of Stata repositories is available here. The many sources of downloads, both in repositories and individuals’ websites makes counting downloads a very difficult task.
Kaggle.com is a website that sponsors data science contests. People post problems there along with the amount of money they are willing to pay the person or team who solves their problem the best. Both money and the competitors’ reputations are on the line, so there’s strong motivation to use the best possible tools. Figure 7 compares the usage of the top two tools chosen by the data scientists working on the problems. From April 2015 through July 2016, we see the usage of both R and Python growing at a similar rate. At the most recent time point, Python has pulled ahead slightly. Much more detail is available here.
Growth in Capability
The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2016) acquired them for R’s main distribution site http://cran.r-project.org/ for each version of R. To simplify ongoing data collection, I kept only the values for the last version of R released each year (usually in November or December), and collected data through the most recent complete year.
These data are displayed in Figure 8. The right-most point is for version 3.2.3, released on 12/10/2015. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995).
To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions, etc., in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, and QC). In 2015, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2015 alone, R added more functions/procs than SAS Institute has written in its entire history.
Of course, while SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4, and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.
As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in Fig. 10. A program run on 4/19/2016 counted 11,531 R packages at all major repositories, 8,239 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 40% higher on the y-axis than the one shown in Figure 10.
As with any analysis software, individuals also maintain their own separate collections available on their websites. However, those are not easily counted.
What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor, and GitHub. They indicate that there is an average of 19.78 functions per package. Given the package count of 11,531, as of 4/19/2016, there were approximately 228,103 total functions in R. In total, R has approximately 190 times as many commands as its main commercial competitor, SAS.
I previously included graphs from Google Trends. That site tracks not what’s actually on the Internet via searches but rather the keywords and phrases that people are entering into their Google searches. That ended up being so variable as to be essentially worthless. For an interesting discussion of this topic, see this article by Rick Wicklin.
Website Popularity – in previous editions, I have included measures of this. However, as the corporate landscape has consolidated, we end up comparing huge companies with interests far outside the field of analytics (e.g., IBM) with relatively small focused ones, which no longer makes sense.
Although the ranking of each package varies depending on the criteria used, we can still see major trends. Among the software that tends to be used as a collection of pre-written methods, R, SAS, SPSS, and Stata tend to always be toward the top, with R and SAS occasionally swapping places depending on the criteria used. I don’t include Python in this group as I rarely see someone using it exclusively to call pre-written routines.
Among the software that tends to be used as a language for analytics, C/C#/C++, Java, MATLAB, Python, R, and SAS are always towards the top. I list those in alphabetical order since many of the measures cover not only use for analytics but for other uses as well. Among my colleagues, those who are more towards the computer science side of the data science field tend to prefer Python, while those who are more towards the statistics send tend to prefer R. A language worth mentioning is Julia, whose goal is to have syntax as clean as Pythons while maintaining the top speed reached by the C/C#/C++ group.
A trend that I find very interesting is the rise of software that uses the workflow (or flowchart) style of control. While menu-driven software is easy to learn, it’s not easy to re-use the work. Workflow-driven software is almost as easy — the dialog boxes that control each node are almost identical to menu-driven software — but you also get to save and re-use the work. Software that uses this approach includes Alteryx, KNIME, RapidMiner, SPSS Modeler (the first to popularize this approach), and SAS Enterprise Miner. The wide use of this interface is allowing non-programmers to make use of advanced analytics.
I’m interested in other ways to measure software popularity. If you have any ideas on the subject, please contact me at email@example.com.
If you are a SAS or SPSS user interested in learning more about R, you might consider my book, R for SAS and SPSS Users. Stata users might want to consider reading R for Stata Users, which I wrote with Stata guru Joe Hilbe.
I am grateful to the following people for their suggestions that improved this article: John Fox (2009) provided the data on R package growth; Marc Schwartz (2009) suggested plotting the amount of activity on e-mail discussion lists; Duncan Murdoch clarified the pitfalls of counting downloads; Martin Weiss pointed out both how to query Statlist for its number of subscribers; Christopher Baum provided information regarding counting Stata downloads; John (Jiangtang) HU suggested I add more detail from the TIOBE index; Andre Wielki suggested the addition of SAS Institute’s support forums; Kjetil Halvorsen provided the location of the expanded list of Internet R discussions; Dario Solari and Joris Meys suggested how to improve Google Insight searches; Keo Ormsby provded useful suggestions regarding Google Scholar; Karl Rexer provided his data mining survey data; Gregory Piatetsky-Shapiro provided his KDnuggets data mining poll; Tal Galili provided advice on blogs and consolidation, as well as Stack Exchange and Stack Overflow; Patrick Burns provided general advice; Nick Cox clarified the role of Stata’s software repositories and of popularity itself; Stas Kolenikov provided the link of known Stata repositories; Rick Wicklin convinced me to stop trying to get anything useful out of Google Insights; Drew Schmidt automated some of the data collection; Peter Hedström greatly improved my search string for Stata; Rudy Richardson pointed out that GraphPad Prism is widely used for statistical analysis; Josh Price and Janet Miles provided expert editorial advice.
J. Fox. Aspects of the Social Organization and Trajectory of the R Project. R Journal, http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, 1996.
R. Muenchen, R for SAS and SPSS Users, Springer, 2009
R. Muenchen, J. Hilbe, R for Stata Users, Springer, 2010
M. Schwartz, 1/7/2009, http://tolstoy.newcastle.edu.au/R/e6/help/09/01/0517.html
Alpine, Alteryx, Angoss, Microsoft C#, BMDP, IBM SPSS Statistics, IBM SPSS Modeler, InfoCentricity Xeno, Oracle’s Java, SAS Institute’s JMP, KNIME, Lavastorm, Mathworks’ MATLAB, Megaputer’s PolyAnalyst, Minitab, NCSS, Python, R, RapidMiner, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., SAP’S KXEN, Stata, Statistica, Systat, WEKA / Pentaho have registered trademarks of their respective companies.
Copyright 2010-2022 Robert A. Muenchen, all rights reserved.