While R has more methods than any other analytics software, it has been missing a crucial feature found in most other packages. SPSS Modeler had it first, way back when they still called it Clementine. Then SAS Institute realized how crucial it was to productivity and added it to Enterprise Miner. As its reputation spread, it was added to RapidMiner, Knime,Statistica, Weka, and others. An early valiant attempt was made to add it to R. What is it? It’s the flowchart-style graphical user interface. And it will soon be available in R at last.
While menu-driven interfaces such as R Commander, Deducer or SPSS are somewhat easier to learn, the flowchart interface has two important advantages. First, you can often get a grasp of the big picture as you see steps such as separate files merging into one, or several analyses coming out of a particular data set (see figure). Second, and more important, you have a precise record of every step in your analysis. This allows you to repeat an analysis simply by changing the data inputs. Instead, menu-driven interfaces require that you switch to the programs that they create in the background if you need to automatically re-run many previous steps. That’s fine if you’re a programmer, but if you were a good programmer, you probably would not have been using that type of interface in the first place!
Alteryx’ flowchart user interface, soon to be added to Revolution R Enterprise.
This week Revolution Analytics and Alteryxannounced that future versions of Revolution R Enterprise will include Alteryx’ flowchart-style graphical user interface. Alteryx has traditionally focused on the analysis of spatial data, only adding predictive analytics in 2012 (skip 37 minutes into this presentation.) This partnership will also allow them to add Revolution’s big data features to various Alteryx products. Both companies are likely to get a significant boost in sales as a result.
While I expect both companies will benefit from this partnership, they could do much better. How? By making the Alteryx interface available for the community (free) version of R. If most R users were familiar with this interface, they would be much more likely to choose Alteryx’ tools when they needed them, instead of a competitor’s. When people needed big data tools for R, they’d be more likely to turn to Revolution Analytics. I am convinced that as great as R’s success has been, it could be greater still with a top-quality flowchart user interface that was freely available to all R users. Given the great advantages that this type of interface offers, it’s just a matter of time until a free version appears. The only question is: who will offer it?
[Update: It turns out that Alteryx is already offering a free version that works with the community version of R! See the comment from Dan Putler, product manager and one of the primary developers of Alteryx’s R-based predictive analytics and business data mining tools. I’ll be trying this out and will report my experiences in a future blog post.]
If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in October.
If you already know another analytics package, the workshop, Intro to R for SAS, SPSS and Stata Users may be for you. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later, or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. You can see a complete out line and register for the workshop here.
If you already know R, but want to learn more about data management, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. That course outline and registration is here.
If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.
Tracking the job market for statistics, analytics, data mining and the like used to be a major undertaking. However, on November 10, 2011 the world’s largest web site for job postings, Indeed.com, released a tool that allows you to examine trends of your own choosing. David Smith, of Revolution Analytics, recently used this tool to compare the job markets for SAS, R, SPSS and even COBOL.
As easy as this tool is to use, some things are inherently difficult to search for. The name of the fastest growing analytics package, R, is not easy to separate from all sorts of other uses of that letter. Adding logical conditions to the search will help get a more relevant answer, but there is no perfect search for this software. For example, adding “statistics,” as David did helps a lot, but it includes jobs that use statistics (but not R) for the extremely popular job categories:
R&D = Research and Development
H.R. = Human Resources
A/R = Accounts Receivable
In studying the results of many types of searches previously, I settled on a very long query that depended on R appearing in sentences like, “the successful job applicant will have expertise in SAS, SPSS or R.” Commas are ignored in Indeed.com searches, so I used the strings “SAS R”, “R SAS”, “R or SAS”, or “SAS or R”. In addition to SAS, I used the languages: Java, Minitab, Perl, Python, Ruby, SAS, SPSS, SQL and Stata. Unfortunately Indeed’s trend tool does not allow multiple long queries. As a result, my final query is as follows:
“r sas” or “sas r” or “r or sas” or “sas or r” or “r spss” or “spss r” or “r or spss” or “spss or r” or “r stata” or “stata r” or “r or stata” or “stata or r” or “r minitab” or “minitab r” or “r or minitab” or “minitab or r”
Note the confusing use of the word “or”. Outside of quotes, it’s a logical specification as in: X or Y. Withing quotes however, it becomes part of the search string itself, where the job description includes the word “or”. From this point onward when I talk about software, “used for statistical purposes,” I am referring to this precise definition (substituting the package at hand into the query, of course).
Even with this shortened search string, only two at a time would fit into Indeed’s search. Figure 1 shows the plot comparing R and SAS.
Figure 1. The percentage of job postings across time for SAS and R. Both are focused on statistical uses via complex query.
We see that there is an overall pattern of growth for SAS. However, the growth seems to have stagnated from January, 2010 onward. At the most current time-point, the percentage of jobs for SAS is twice as high as for R. That 2-to-1 ratio is far smaller than I reported as recently as two months ago. Why the change? I had previously used complex logic to find R and simpler logic to find SAS. SAS is much easier to find, but by using simpler logic, I was essentially comparing R use for statistics to SAS use for all purposes. While that may sound like an irrelevant comparison, it is one that helps to show that R competes with SAS not just for statistical use, but also for its use in general data processing, report writing and related non-analytic tasks. Once a company is using SAS for report writing, they are more likely to use it for at least the fundamental statistics that come with Base SAS at no additional cost. Below is a graph (Fig. 2) comparing the search string “SAS !SATA !storage !firmware” to the complex R string from above. The exclamation point excludes terms, and the letters S.A.S. also stand for SCSI Attached Storage, which is related to computer firmware, not statistics. As a result, those jobs are excluded from the search.
Figure 2. Percentage of job postings across time for all uses of SAS compared to only statistical uses of R.
We see that the number of jobs for SAS is now far more dominant than before. It’s difficult to assess from the graph but a direct job search shows there are 9 times as many jobs in this type of comparison (11,320 vs. 1,246).
How much broader is the general market for SAS compared to that focused on statistics? A direct job search for SAS for all uses yields 4.5 times as many jobs as a search that focuses on SAS for only statistical purposes (11,162 vs. 2,456). Interestingly, a similar comparison for SPSS results in only a 1.8-fold difference (3,231 vs. 1,808) while one for Stata is only 1.4 times higher (897 vs. 620). The ratios may reflect the breadth of use each package has in business reporting rather than statistical analysis.
Comparing job openings of R to those for SPSS, both for statistical purposes, yields the plot in Figure 3.
Figure 3. Percentage of job postings of SPSS and R and, both for statistical purposes.
We see that both SPSS and R show an overall upward trend, with R much steeper in the more recent years. The data for the most recent time period show that SPSS is still ahead, but not by a very wide margin.
Next, let us examine the trend in jobs for R and Stata (Fig. 4).
Figure 4. Percentage of job postings across time for Stata and R, both used for statistical purposes.
We see that the jobs for Stata grew until mid-2010 where they have since been holding steady. Jobs for R have grown at much higher and steady rate since around January of 2009. In the most recent time period, there are roughly three times as many jobs for R as for Stata.
Given the power and ease of use of Indeed.com’s trend analyzer, I plan to switch the discussion over to it in future versions of The Popularity of Data Analysis Software. I’m very interested in hearing from people who can think of better ways to search for R using Indeed.com’s job trend tool.
Before you can analyze data, it must be in the right form. Join Revolution Analytics and me this June 21st for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr and lubridate.
Many examples come from my books, R for SAS and SPSS Usersand R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.
At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!
Employment is important to us all, so what software skills are employers seeking? A thorough answer to this question would require a time consuming content analysis of job descriptions. However, we can get a rough idea by searching on job advertising sites. Indeed.com is the most popular job search site in the world. As their CEO and co-founder Paul Forster stated, it includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” I used a program that went there weekly and searched jobs descriptions for keywords such as “SPSS” or “Minitab.” This was repeated during the 2nd, 3rd and 4th weeks of March in 2012 and 2013. (The data were meant to be for the complete two years, but the automated process went awry.)
The abbreviation “SAS” is common in computer storage, so I avoided those by searching for “SAS !SATA !storage !firmware” (the exclamation point represents a logical “not”). I focused on R while avoiding related topics like “R&D” by using “R SAS” or “SAS R”, including each package in the graph. The data for 2013 are presented in Figure 11.
Figure 11. Mean number of jobs per week available on Indeed.com for each software (March 2013) [last label should read “BMDP”].
SAS has a very substantial lead in job openings, with SPSS coming in second with just over a quarter of the jobs. R comes in third place with slightly more than half the jobs available for SPSS. Compared to R or Minitab, SAS has over seven times as many jobs available!
Since 2012, job descriptions that included SAS declined by 961 (7.3%) and those containing Minitab declined by 154 (8.7%). Jobs for R increased by 497 (42%) pushing it past Minitab into third place by a slim margin. In fact, all packages except for SPSS and Systat showed significant though much smaller absolute changes (via Holm-corrected paired t-tests (Table 2). Since these comparisons are based on only three data points in each year, I would not put much stock in most of them, but the 48% increase for R is notable.
Given the extreme dominance of SAS, a data analyst would do well to know it unless he or she was seeking a job in a field in which one of the other packages is dominant.
Table 2. Number of jobs on Indeed.com that list each software in March of 2012 and 2013. Changes are significant for all software except SPSS and Systat.
[Since this was originally published in 2013, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]
I recently updated my plots of the data analysis tools used in academia in my ongoing article, The Popularity of Data Analysis Software. I repeat those here and update my previous forecast of data analysis software usage.
Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. As you can see in Fig. 1, the use of most analytic software is growing rapidly in academia. The only one growing slowly, very slowly, is Statistica.
Figure 1. The growth of data analysis packages with SAS and SPSS removed.
While they remain dominant, the use of SAS and SPSS has been declining rapidly in recent years. Figure 2 plots the same data, adding SAS and SPSS and dropping JMP and Statistica (and changing all colors and symbols!)
Figure 2. Scholarly use of data analysis software with SAS and SPSS added, JMP and Statistica removed.
Since Google changes its search algorithm, I recollect all the data every year. Last year’s plot (below, Fig. 3) ended with the data from 2011 and contained some notable differences. For SPSS, the 2003 data value is quite a bit lower than the value collected in the current year. If the data were not collected by a computer program, I would suspect a data entry error. In addition, the old 2011 data value in Fig. 3 for SPSS showed a marked slowing in the rate of usage decline. In the 2012 plot (above, Fig. 2), not only does the decline not slow in 2011, but both the 2011 and 2012 points continue the sharp decline of the previous few years.
Figure 3. Scholarly use of data analysis software, collected in 2011. Note how different the SPSS value for 2011 is compared to that in Fig. 2.
Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.
I forecast the use of R, SAS, SPSS and Stata five years into the future using Rob Hyndman’s forecast package and the default settings of its auto.arima function. The dip in SPSS use in 2002-2003 drove the function a bit crazy as it tried to see a repetitive up-down cycle, so I modeled the SPSS data only from its 2005 peak onward. Figure 4 shows the resulting predictions.
Figure 4. Forecast of scholarly use of the top four data analysis software packages, 2013 through 2017.
The forecast shows R and Stata surpassing SPSS and SAS this year (2013), with Stata coming out on top. It also shows all scholarly use of SPSS and SAS stopping in 2014 and 2015, respectively. Any forecasting book will warn you of the dangers of looking too far beyond the data and above forecast does just that.
Guestimate Forecasting
So what will happen? Each reader probably has his or her own opinion, here’s mine. The growth in R’s use in scholarly work will continue for three more years at which point it will level off at around 25,000 articles in 2015. This growth will be driven by:
The continued rapid growth in add-on packages
The attraction of R’s powerful language
The near monopoly R has on the latest analytic methods
Its free price
The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (IBM is loosening up on this a bit)
What will slow R’s growth is its lack of a graphical user interface that:
Is powerful
Is easy to use
Provides direct cut/paste access to journal style output in word processor format
Is standard, i.e. widely accepted as The One to Use
Is open source
While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its full range of capabilities and its speed of use. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but with so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used software.
The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use are not sure which GUI to teach, so they continue teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a respectable GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.
The use of SPSS for scholarly work will decline less sharply in 2013 and will level off in in 2015 at around 27,000 articles because:
Many of the people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
Many of the people who needed more interactive visualization have already switched to JMP
The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.
Although Stata is currently the fastest growing package, it’s growth will slow in 2013 and level off by 2015 at around 23,000 articles, leaving it in fourth place. The main cause of this will be inertia of users of the established leaders, SPSS and SAS, as well as the competition from all the other packages, most notably R. R and Stata share many strengths and with one being free, I doubt Stata will be able to beat R in the long run.
The other packages shown in Fig. 1 will also level off around 2015, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.
The future of SAS Enterprise Miner and IBM SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes. Both companies could significantly shift their future by combining their two main GUIs. Imagine a menu & dialog-box system that draws a simple flowchart as you do things. It would be easy to learn and users would quickly get the idea that you could manipulate the flowchart directly, increasing its window size to make more room. The flowchart GUI lets you see the big picture at a glance and lets you re-use the analysis without switching from GUI to programming, as all other GUI methods require. Such a merger could give SAS and SPSS a game-changing edge in this competitive marketplace.
So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to do your own forecasts and add links to them in the comment section below. You can use my data or follow the detailed blog at Librestats to collect your own. One thing is certain: the coming decade in the field of analytics will be interesting indeed!
Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On June 17 and 19, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need.
A complete outline of the workshop plus a registration link is here. I have no artistic skills, but I’ve always been amazed at what artists can do. I taught this workshop in Knoxville on April 29, and pro photographer Steve Chastain made it look way more exciting than I recall! His view of it is here; turn your speakers up and get ready to boogie!
To start the group off I’ll teach a hands-on introductory workshop on R on April 29th from 8 a.m. to 5:00 p.m. The topics covered are described at http://r4stats.com/workshops/r4sas-spss-stata/. Note that you do not need to know SAS, SPSS or Stata, but the workshop will include numerous warnings where R works very differently from those other packages. The workshop is free and open to the Knoxville area public. UT faculty, staff and students can register at: http://oit.utk.edu/training and non-UT people can register at the user group web site: http://www.meetup.com/Knoxville-R-Users-Group. Course location, materials including slides, programs, practice data sets and exercises will be available on http://www.meetup.com/Knoxville-R-Users-Group on Saturday, April 27 (if not before).
April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.
While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.
While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”
Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”
The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”
The capability of all the software in this article has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.
Figure 10 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.995). Early version numbers of R increase by 0.10 while more recent ones increased by 0.01. To make the x-axis consistent, the graph displays simply the numerical order in which the versions were released. The right-most point is for version 2.15.2, the last version released in 2012.
Figure 10. Number of R packages plotted for each major release of R. The last value on the x-axis represents version 2.15.2, the final release in 2012.
As rapid as this growth has been, the data in Figure 10 represents only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 3/19/2013 counted 6,275 R packages at all major repositories, 4,315 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 30% higher on the y-axis than the one shown in Figure 10. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.
To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In its most recent version, 9.3, SAS offers 100 programming statements, 258 procedures (Base, STAT, ETS, Graph, HP Forecasting, Macro, OR, QC) and 520 SAS functions and call routines, and 314 IML statements, functions and subroutines for a total of 1,192 items that are roughly equivalent to R functions. R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 31,375 functions compared to SAS’ 1,192. In fact, during 2012 alone, R added more functions/procs than SAS Institute has provided in its entire history! That’s 701 packages, counting only CRAN, or around 3,505 new functions in 2012.
Of course these R functions and SAS procedures / functions are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, giving them potentially more output per command. However, R functions can nest inside one another, creating nearly infinite combinations of output. While the comparison is not perfect, it is certainly an eye opener.
Stay tuned for future updates which will include what employers are now advertising for and recent trends in academic use of analytic software.