Every other year Rexer Analytics surveys Data Analysts, Predictive Modelers, Data Scientists, Data Miners, and all other types of analytic professionals, students, and academics regarding the software they use. I then update the main results in The Popularity of Data Analysis Software. Here’s where you can take the survey:
Survey results will be unveiled at the Fall-2015 Boston Predictive Analytics World event.
Rexer Analytics has been conducting the Data Miner Survey since 2007. Each survey explores the analytic behaviors, views and preferences of data miners and analytic professionals. Over 1,200 people from around the globe participated in the 2013 survey. Summary reports (40 page PDFs) from previous surveys are available FREE to everyone who requests them by emailing DataMinerSurvey@RexerAnalytics.com. Also, highlights of earlier Data Miner Surveys are available at www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html, including best practices shared by respondents on analytic success measurement, overcoming data mining challenges, and other topics. The FREE Summary Report for this 2015 Data Miner Survey will be available to everyone Fall-2015.
Please help spread the word.
Rexer Analytics is a consulting firm focused on providing data mining and analytic CRM solutions. Recent solutions include customer loyalty analyses, customer segmentation, predictive modeling to predict customer attrition and to target direct marketing, fraud detection, sales forecasting, market basket analyses, and complex survey research. More information is available at www.RexerAnalytics.com or by calling +1 617-233-8185.
As the R programming environment has grown in capability and popularity, so have the number of organizations planning to migrate to it from proprietary tools. I’ve helped members of various organizations transition from SAS, SPSS and/or Stata to R (see Workshop Participants), and the process typically involves the following steps:
1) Begin with the most important question: who should you migrate to R? Learning R is not a trivial task (see Why R is Hard to Learn). However, once mastered by people who use it regularly, I think it’s easier to use than other software. But if you have some people who use something like SAS only occasionally and view it as hard to use, you might consider getting them something other than R. Menu-based solutions such as SPSS or R Commander may be a better fit for them. If they want to continue using SAS while lowering your licensing costs, you might consider the SAS implementation used in WPS (see World Programming).
2) Motivate people to migrate. Discussing your current software budget may help. Showing your staff the growth of R’s capabilities and popularity may also help (see The Popularity of Data Analysis Software). Keep in mind that attempting to motivate people to change by criticizing their current choice is likely to backfire. People’s choice of software is very personal and criticizing it is like telling them they have the wrong religion.
3) Use training & documentation that leverages what they already know, that speaks their language. A trainer who knows both your existing environment and R can convert what the analysts currently know rather than simply starting from scratch (note that this self-serving advice!) There are two parts to this process: learning the new R code and learning to interpret the new R output. Choosing to use R packages that provide output similar to your current software choice will help smooth the transition. Good sources for training are listed here: my own on site training, Training and Consulting Partners – RStudio and here: R for SAS, SPSS and STATA Users | DataCamp. Books that help with the conversion process include:
4) Provide in-house tech support. Before training a whole team, get one in-house expert trained to act as a consultant to others. Make sure this person is well known by everyone and has time freed up to provide help.
5) Match your staff’s current work style, work flow and output. This is a particularly complex topic. Some examples: if your people are running SAS from the Excel plug-in, get the the R plug-in; if they’re using Enterprise Miner, consider a similar interface that controls R such as the KNIME Analytics Platform. If Microsoft Word is their main word processor, don’t complicate the conversion by switching them to LaTeX text processor at the same time (LaTeX is wonderful and very popular among R users, but it’s a mess to learn that and R at the same time). Instead, use an approach that generates Word output.
6) Migrate one step at a time if possible. For example, if you use SAS/ETS for forecasting, consider replacing just that one piece. When finished and successful, choose the next product, saving SAS/Base for last.
7) Convert your programs or use conversion services. If your programs are all in production, this could be a huge job. However, if you mostly use SAS for new research tasks, you may not need to convert old code from which you just needed a solution. Be careful to avoid line-by-line conversion; think like R (e.g. avoid for- and while-loops in R). When using external conversion services, make sure to involve your own staff in the process so you don’t end up with code that’s almost impossible to maintain.
I have found that following these steps helps during conversions to R. It’s a big job, though, so allocate plenty of time to it. Good luck!
I’ve been tracking The Popularity of Data Analysis Software for many years now, and a clear trend is the decline of the market share of the bigger analytics firms, notably SAS and SPSS. Many people have interpreted my comments as implying the decline in the revenue of those companies. But the fields involved in analytics (statistics, data mining, analytics, data science, etc.) have been exploding in popularity, so having a smaller slice of a much bigger pie still leaves billions in revenue for the big players.
Each year, the Gartner Group, “the world’s leading information technology research and advisory company”, collects data in a survey of the customers of 42 business intelligence firms. They recently released the data on the customers’ plans to discontinue use of their current software in one to three years. The results are shown in the figure below. Over 16% of the SAS Institute customers surveyed reported considering discontinuing their use of the software, the highest of any of the vendors shown. It will be interesting to see if this will actually lead to an eventual decline in revenue. Although I have helped quite a few organizations migrate from SAS to R, I would be surprised to see SAS Institute’s revenue decline. They offer excellent software and service which I still use, though not anywhere near as much as R.
I’ve updated one of my most widely read blog posts, Why R is Hard to Learn. It focuses on the aspects of R which tend to trip up beginners. The new version is over twice as long as the original and it is located under the Articles menu, making it easier to find. Of course my new interactive workshop on DataCamp.com and my up-coming webinars with Revolution Analtyics cover these trouble spots thoroughly.
[Since this was originally published in 2014, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]
Here is my latest update to The Popularity of Data Analysis Software. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:
As I covered in my talk at the UseR 2014 meeting, it is very likely that during the summer of 2014, R became the most widely used analytics software for scholarly articles, ending a spectacular 16-year run by SPSS.
Stata has probably passed Statistica in scholarly use, and its rapid rate of growth parallels that of R.
If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.
Scholarly Articles
The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.
Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.
From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.
Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.
Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2015 (correction: this said 2014 when originally posted).
After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”
Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.
The Knoxville R Users Group (KRUG) is hosting a brown bag viewing of RStudio’s webinar “Interactive Reporting” at 11am, Weds 3-Sept-2014 in 427 Hesler on the UTK campus “Hill” . Per RStudio.net, data scientist Garrett Grolemund and software engineer Joe Cheng will speak on how to make your R Markdown documents interactive, and then unleash the full flexibility of analytic app development with shiny. Come join us!
This free, global webinar will provide an introduction to jpmml, the world’s leading open-source PMML scoring engine currently being utilized by companies such as Airbnb to rapidly deploy predictive models into production.
Webinar Format:
– What is PMML?
– Building a predictive model in R and exporting it to PMML format
– Deploying a PMML model into a cloud-based Openscoring service
– Scoring Google Spreadsheet data
– Scoring PostgreSQL data
– Scoring Hadoop data
– Q&A
Speaker:
– Villu Ruusmann, Creator of jpmml and Founder of Openscoring.io
This event is brought to you by The Orange County R User Group.
The Orange County R User Group (OC-RUG) will soon host a free webinar on the new “RSelenium” R Package which provides a set of R bindings for the Selenium 2.0 webdriver using the JsonWireProtocol. Using RSelenium, you can automate web browsers locally or remotely in order to test various web apps, such as Shiny applications – for example.
Webinar Format:
– Introduction to the RSelenium R package
– Live Demonstration
– Question and Answer period
Date: May 21, 2014 at 10 am Pacific (California) time
Speaker:
John Harrison, RSelenium package author/maintainer
For more information on the RSelenium package, please visit this site:
Please note that in addition to attending from your laptop or desktop computer, you can also attend from a Wi-Fi connected iPhone, iPad, Android phone or Android tablet by installing the GoToMeeting App.
In my never-ending quest to study the Popularity of Data Analysis Software, I recently read the 2013 Edition of the Wisdom of Crowds Business Intelligence Market Study by Dresner Advisory Services, LLC. In it, I found the table below which displays the “Wisdom of Crowds”, or what most people would call survey results.
As you can see, among high growth business intelligence vendors, Tableau comes out on top with a mean score of 4.40. Not long afterwards, I saw this press release that claimed, “TIBCO Spotfire Named the Leader in ‘High Growth Business Intelligence’ Market Segment. Spotfire Achieves ‘Best in Class’ in Wisdom of Crowds SME Business Intelligence Study.” That can’t be right, can it? I downloaded the report, expecting it to be for 2014. However, it was from the same year as the previous one I had read, 2013, and it contained the following familiar-looking table.
To my surprise, Spotfire was now in first place, with Tableau in 3rd. Then I belatedly noticed the “SME” in Tibco’s headline. It turns out that the second report was a based on a subset of the first, selecting responses only from Small and Mid-sized Enterprises (SMEs). While the reports looked identical to me at first, the subset report does include “Small and Mid-sized Enterprises” right in its title. So both company claims are correct once you figure out who is being surveyed.
But what does this mean for the value of a #1 ranking from Dresner? The reports break the 23 vendors down into 5 groups: Titans, Large Established Pure-Play, High Growth, Specialized, and Emerging (they mention Early Stage ones too, but don’t rank them). This approach results in four or five vendors per table. By doing the report twice, once for all respondents and again for SMEs, there are 10 opportunities for companies to be #1 in any given year. In the two Dresner reports from 2013, 7 of the 23 companies are #1. Companies are more likely to purchase distribution rights to reports when they come out looking good. Since Dresner makes a living selling reports, that gives a whole new meaning to the term Business Intelligence!
A free webinar will provide an introduction to the “JMBayes” R package which provides methods for Joint Modeling of Longitudinal and Time-to-Event Data under a Bayesian Approach.
Webinar Format:
– Introduction to Joint Models and the JMBayes R package
– Live demonstration
– Question and Answer period
Speaker:
– Dimitris Rizopoulos, JMBayes Package Maintainer
For more information on the JMBayes package, please visit this site:
Please note that in addition to attending from your laptop or desktop computer, you can also attend from a Wi-Fi connected iPhone, iPad, Android phone or Android tablet by installing the GoToMeeting App.