May, 2015 | r4stats.com

R #1 by Wide Margin in Latest KDnuggets Poll

The results of the latest KDnuggets Poll on software for Analytics, Big Data and Data Mining are out, and R has moved into the #1 position by a wide margin. I’ve updated the Surveys of Use section of The Popularity of Data Analysis Software to include a subset of those results, which I include here:

…The results of a similar poll done by the KDnuggets.com web site in May of 2015 are shown in Figure 6b. This one shows R in first place with 46.9% of users reporting having used it for a real project. RapidMiner, SQL, and Python follow quite a bit lower with around 30% of users. Then at around 20% are Excel, KNIME and HADOOP. It’s interesting to see what has happened to two very similar tools, RapidMiner and KNIME. Both used to be free and open source. RapidMiner then adopted a commercial model, with an older version still free. KNIME kept its desktop version free and, likely as a result, its use has more than tripled over the last three years. SAS Enterprise Miner uses a very similar workflow interface, and its reported use, while low, has almost doubled over the last three years. Figure 6b only shows those packages that have at least 5% market share. KDnuggets’ original graph and detailed analysis are here.

KDnuggests 2015 — Figure 6b. Percent of respondents that used each software in KDnuggets’ 2015 poll. Only software with 5% market share are shown. The % alone is the percent of tool voters that used only that tool alone. For example, only 3.6% of R users have used only R, while 13.7% of RapidMiner users indicated they used that tool alone. Years are color coded, with 2015, 2014, 2013 from top to bottom.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do R training on-site.

R Now Contains 150 Times as Many Commands as SAS

by Bob Muenchen

In my ongoing quest to analyze the world of analytics, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

Fig_9_CRAN — Figure 9. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in
Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R functions for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories except GitHub, on that date there were approximately 182,393 total functions in R. In total, R has over 150 times as many commands as SAS.

I’ve Been Replaced by an Analytics Robot

It was only a few years ago when the N.Y. Times declared my job “sexy”. My old job title of statistician had sounded dull and stodgy, but then it became filled with exciting jargon: I’m a data scientist doing predictive analytics with (occasionally) big data. Three hot buzzwords in a single job description! However, in recent years, the powerful technology that has made my job so buzzworthy has me contemplating the future of the field. Computer programs that automatically generate complex models are becoming commonplace. Rob Hyndman’s forecast package for R, SAS Institite’s Forecast Studio, and IBM’s SPSS Forecasting offer the ability to generate forecasts that used to require years of training to develop. Similar tools are now available for other types of models as well.

Countless other careers have been eliminated due to new technology. The United States previously had over 70% of the population employed in farming and fewer than 2% are farmers today. Things change, people move on to other careers. The KDnuggests web site recently asked its readers, “When will most expert-level Predictive Analytics/Data Science tasks – currently done by human Data Scientists – be automated?” Fifty-one percent of the respondents – most of them data scientists themselves – estimated that this would happen within 10 years. Not all the respondents had such a dismal view though; 19% said that this would never happen.

My brain being analyzed by the machine that replaced my brain! (Photograpy by Mike O’Neil)

If you had asked me in 1980 what would be the very last part of my job to be eliminated through automation, I probably would have said: brain wave analysis. It had far more steps involved than any other type of work I did. We were measuring the electrical activity of many parts of the brain, at many frequencies, thousands of times per second. An analysis that simply compared two groups would take many weeks of full-time work. Surprisingly, this was the first part of my job to be eliminated. However, our statistical consulting team supports many different departments, so I didn’t really notice when work stopped arriving from the EEG Lab. Years later I got a call from the new lab director offering to introduce me to my replacement: a “robot” named LORETA.

When I visited the lab, I was outfitted with the usual “bathing cap” full of electrodes. EEG paste (essentially K-Y jelly) was squirted into a hole in each electrode to ensure a good contact and the machine began recording my brain waves. I used bio-feedback to generate alpha waves which made a car go around a track in a simple video game. Your brain creates alpha waves when you get into a very relaxed, meditative state. Moments after I finished, LORETA had already analyzed my brain waves. “She” had done several weeks of analysis in just a few moments.

So that part of my career ended years ago, but I didn’t really notice it at the time. I was too busy using the time LORETA freed up to learn image analysis using ImageJ, text mining using WordStat and SAS Text Miner, and an endless variety of tasks using the amazing
R language. I’ve never had a moment when there wasn’t plenty of interesting new work to do.

There’s another aspect to my field that’s easy to overlook. When I began my career, 90% of the time was spent “battling” computers. They were incredibly difficult to operate. Today someone may send you a data file and you’ll be able to see the data moments after receiving it. In 1980 data arrived on tapes, and every computer manufacturer used a different tape format, each in numerous incompatible variations. Unless you had a copy of the program that created a tape, it might take days of tedious programming just to get the data off of it. Even asking the computer to run a program required error-prone Job Control Language. So from that perspective, easier-to-use computing technology has already eliminated 90% of what my job used to be. It wasn’t the interesting part of the job, so it was a change for the better.

Will the burgeoning field of data science eventually put itself out of business by developing a LORETA for every problem that needs to be solved? Will we just be letting our Star-Trek-class computers and robots do our work for us while we lounge around self-actualizing? Perhaps some day, but I doubt it will happen any time soon!

Stata’s Academic Growth Nearly as Fast as R’s

by Bob Muenchen

Analytics tools take significant effort to master, so once learned people tend to stick with them for much of their careers. This makes the tools used in academia of particular interest in the study of future trends of market share. I’ve been tracking The Popularity of Data Analysis Software regularly since 2010, and thanks to an astute reader, I now have a greatly improved estimate of Stata’s academic growth. Peter Hedström, Director of the Institute for Analytical Sociology at Linköping University, wrote to me convinced that I was underestimating Stata’s role by a wide margin, and he was right.

Fig_2e_ScholarlyImpactBig6

Two things make Stata’s popularity difficult to guage: 1) Stata means “been” in Italian, and 2) it’s a common name for the authors of scholarly papers and those they cite. Peter came up with the simple, but very effective, idea of adding Statacorp’s headquarter, College Station, Texas, to the search. That helped us find far more Stata articles while blocking the irrelevant ones. Here’s the search string we came up with:

("Stata" "College Station") OR "StataCorp" OR "Stata Corp" OR 
"Stata Journal" OR "Stata Press" OR "Stata command" OR 
"Stata module"

The blank between Stata and College Station is an implied logical “and”. This string found 20% more articles than my previous one. This success motivated me to try and improve some of my other search strings. R and SAS are both difficult to search for due to how often those letters stand for other things. I was able to improve my R search string by 15% using this:

"r-project.org" OR "R development core team" OR "lme4" OR 
"bioconductor" OR "RColorBrewer" OR "the R software" OR 
"the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR 
"knitr" OR "RODBC" OR "stringr" OR "mass package"

Despite hours of effort, I was unable to improve on the simple SAS search string of “SAS Institute.” Google Scholar’s logic seems to fall apart since “SAS Institute” OR “SAS procedure” finds fewer articles! If anyone can figure that out, please let me know in the comments section below. As usual, the steps I use to document all searches are detailed here.

The improved search strings have affected all the graphs in the Scholarly Articles section of The Popularity of Data Analysis Software. At the request of numerous readers, I’ve also added a log-scale plot there which shows the six most popular classic statistics packages:

If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop,
R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do training on-site but I’m often booked about 8 weeks out.

I invite you to follow me on this blog and on Twitter.