Blog | r4stats.com

r4stats.com 2013 in review

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 150,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 6 days for that many people to see it.

Click here to see the complete report.

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution R Enterprise, or SAS vs. the World Programming System (WPS) or Carolina, except when it is particularly easy such as comparing the company Pagerank figures.

These packages are all included in the first section on jobs, but later sections are older (each contains a date) and do not cover an as extensive set of software. I’ll add those as I can and announce the changes on Twitter where you can follow me as @BobMuenchen.

Introduction

When choosing a tool for data analysis, now more broadly referred to as analytics, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output in the form you prefer (e.g. cut & paste into a word processor vs. LaTeX integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it?

There are many ways to measure popularity or market share and each has its advantages and disadvantages. Here they are, in approximate order of usefulness:

Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
Published Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. What makes this measure particularly easy to gather is that consolidators like Tal Galili have created blog consolidation sites like R-Bloggers.com which make it easy to count the blogs. Previously that had been a difficult task.
Web Site Popularity – how does Google provide the most popular search results at the top of its response to your queries? A major component of that answer comes from the total number of web pages that point to any given web site. That’s known as a site’s PageRank. This is objective data, and for sites that clearly focus on analytics, it’s unbiased. However, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, and probably fewer that discuss analytics point to it as well. But it may be impractical to tell which is which.
Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment.
Discussion Forums – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others.
Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
IT Research Firms – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.

Job Advertisements

One of the best ways to measure the popularity or market share of software for analytics is to count the number of job advertisements for each. Indeed.com is the biggest job site in the U.S. making its sample the most representative of the current job market. As their CEO and co-founder Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” To demonstrate just how dominant its lead is, a search for SPSS (on 2/19/14) showed more than ten times as many jobs on Indeed.com as on its well-known competitor, Monster.com. Indeed.com also has superb search capabilities and it even includes a tool for tracking long-term trends.

Searching for analytics jobs using Indeed.com can be easy, but it can also be very tricky. For many of the analytics software that required only a simple search on its name. However, for software that’s hard to locate (e.g. R) or that is general purpose (e.g. Java) it required complex searches and/or some rather tricky calculations which are described here. All of the graphs in this section use those procedures to make the required queries.

Figure 1a shows that Java and SAS are in a league of their own, with around 50% more analytics jobs than Python or C, C++/C# and twice as much as R. (The three aforementioned C variants are combined in a single search since job advertisements usually seek any of them). Python and C/C++/C# come next at an almost identical level of popularity. That’s not too surprising as many advertisements for analytics jobs that use programming mention both together.

Figure 1a. The number of analytics jobs for the more popular software (2/2014).

R resides in an interestingly large gap between the other domain-specific languages, SAS and SPSS. This is the first estimate I’ve done that shows that the job market for R has not only caught up with SPSS, but surpassed it by close to double the number of job postings. I knew my previous estimates for R jobs was low, but I had not yet thought of a better way to estimate the total. From SPSS on down, there’s a smooth decline. Enterprise Miner is the only data-mining-specific software to make the cutoff of at least 100 jobs. If I plotted all the software below that point, they would all pile up on the y-axis, appearing to have almost no jobs. Relatively speaking, they don’t!

Software that did not make that cut and are not displayed on the graph are: Alteryx (68), Statistica (67), RapidMiner (38), SPSS Modeler (36), KXEN (28), KNIME (26), Julia (15), Statgraphics (11), Systat (10), BMDP (8), Angos (6), Lavastorm (5), NCSS (4), Salford SPM etc. (3), Teradata Miner (2) and Oracle Data Mining (2).

It’s important to note that the values shown in Figure 1a are single points in time. The number of jobs for the more popular software do not change much from day to day, but each software has an overall trend that shows how the demand for jobs changes across the years. You can plot such trends using Indeed.com’s Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time means they both have to fit in the same query limit allowed by Indeed.com. Those details are described here.

I’m particularly interested in trends involving R, so let’s look at a couple of comparisons. Figure 1b compares the number of analytics jobs available for R and SPSS across time. Analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.

Fig_1b_RvSPSS_2014-2-22 — Figure 1b. Analytics job trends for R and SPSS. Note that the legend labels are truncated due to the very long size of the query.

We know from Figure 1a that SAS is still far ahead of R in analytics job postings. How far does R have to go to catch up with SAS? Figure 1c provides one perspective. It would be nice to have the data to forecast when R’s growth curve will catch up with SAS’s, but Indeed.com does not provide the raw data. However, we can use the approximate slope of each line to get a rough estimate. If jobs for SAS stay level and those for R continue to grow linearly as they have since January 2010, then R will catch up in 3.35 years. If instead the demand for SAS jobs that started in January of 2012 continues, then R will catch up in 1.87 years.

Figure 1c. The trend in analytics jobs for R and SAS. — Figure 1c. Analytics job trends for R and SAS. Legend labels are truncated due to long query length.

A debate has been taking place on the Internet regarding the relative place of Python and R. Ironically, this debate about software to do data analytics has involved very little actual data. However it is possible now to at least study the job trends. Figure 1a showed us that Python is well out in front of R, at least on that single day the searches were run. What has the data looked like over time? The answer is in Figure 1d.

Figure 1c. Jobs trends for R and Python (2/22/14). — Figure 1d. Jobs trends for R and Python. Legend labels are truncated due to long query length.

Note that in this graph, Python appears to have a relatively slight advantage while in Figure 1a it had a huge one. The final point on the trend graph was done only two days after the queries used in Figure 1a, and that data changed very little in the meantime. The difference is due to the fact that Indeed.com has a limit on query length. Here is the query used for Figure 1c, and the analytic terms it contains were fewer than the one used for Figure 1a.

R 
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")
!"R D" !"A R" !"H R" !"R N" 
!toys !kids !" R Walgreen" !walmart
!"HVAC R" !"R Bard" 
,
python
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")

The detailed description regarding the construction of all the queries used in Figures 1a through 1c is located here.

==============================================================

At this point, the rest of The Popularity of Data Analysis Software will continue, offering many additional perspectives on measuring analytics market share. However, until I update those sections in the coming months, they will not cover as broad a range of software. Stay tuned on Twitter, by following @BobMuenchen.

If you know SAS, SPSS or Stata and have not yet learned R, you can join me for this web-based workshop aimed at translating your knowledge into R. The next workshop begins on April 21. If you do know R and would like to learn more, you might enjoy taking Managing Data with R. The next time I’m offering that is on April 25.

Learn R and/or Data Management from Home January or April

If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in January and April.

If you already know another analytics package, the workshop, Intro to R for SAS, SPSS and Stata Users may be for you. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later, or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. You can see a complete out line and register for the workshop starting January 13 (click here) or April 21 (click here).

If you already know R, but want to learn more about how you can use R to get your data ready to analyze, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. The course outline and registration is available here for January and here for April.

If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.

Knoxville R User’s Group Meeting November 1

The next meeting of the Knoxville R User’s Group will consist of four 20-minute talks followed by an open planning session. It will take place on Friday, November 1, from 2:00 p.m. to 4:00 p.m. at The University of Tennessee, Haslam Business Administration Building, room 403 (1000 Volunteer Blvd., Knoxville, TN). RSVP at http://www.meetup.com/Knoxville-R-Users-Group. The topics and biographical information regarding the speakers are listed below.

Automated Forecasting using R: A Stock Market Example (2:00-2:20)

R’s forecast package can be used to generate automated ARIMA model forecasts in a method comparable to SAS Forecast Server. This talk will demonstrate how to use the R ‘quantmod’ package to query financial data from Yahoo finance and then utilize the data in the forecast package to automatically produce point forecasts and prediction intervals. Examples of how to use each package, including diagnostic plots and results, will be included.

Josh Price earned a BS and MS in statistics, both from the University of Tennessee. While working on his Master’s, he worked as a graduate assistant for Research Computing Support. After graduating, Josh worked for 7 years in industry as a consultant in both business and engineering. In January 2013, he returned to UT to work as a statistical consultant where he assists students, faculty, and staff with statistical aspects of their theses, dissertations and various research projects. Josh’s current interests include programming, forecasting methods, and quantitative finance.

BioGeoBEARS: An R package for inference and model testing in historical biogeography (2:20-2:40)

Phylogenetic biogeography is traditionally concerned with the inference of ancestral geographic ranges on a phylogeny, and of inferring the history of events that lead to present-day distributions. The field has been dominated for decades by debates about whether vicariance or dispersal is the dominant process. This talk will demonstrate, using BioGeoBEARS, that assumptions about the processes can be subject to statistical inference from the data, and show that founder-event speciation is a crucial process that has been left out of the current biogeography programs DIVA, LAGRANGE, and BayArea.

Nicholas J. Matzke is a Postdoctoral Fellow in Mathematical Biology at the National Institute for Mathematical and Biological Synthesis (NIMBioS, www.nimbios.org)) at UT Knoxville, and a member of Brian O’Meara’s lab in the Department of Ecology and Evolutionary Biology. He is also the author of the BioGeoBEARS package.

Elevating R to Supercomputers (2:40-3:00)

The biggest supercomputing platforms in the world are distributed memory machines, but the overwhelming majority of the development for parallel R infrastructure has been devoted to small shared memory machines. Additionally, most of this development focuses on task parallelism, rather than data parallelism. But as big data analytics becomes ever more attractive to both users and developers, it becomes increasingly necessary for R to add distributed computing infrastructure to support this kind of big data analytics which utilize large distributed resources. The Programming with Big Data in R (pbdR) project aims to provide such infrastructure, elevating the R language to these massive-scale computing platforms. This talk will cover some of the early successes of the pbdR project, benchmarks, challenges, and future plans.

Drew Schmidt is a researcher at the University of Tennessee’s National Institute for Computational Sciences, and is primarily interested in the intersection of mathematics, statistics, and high-performance computing. He is co-lead developer of the Programming with Big Data in R (pbdR) project, which elevates the statistics programming language R to large distributed computing platforms.

BREAK (3:00-3:10)

Analyzing Data by Group Using R’s plyr Package (3:10-3:30)

A common data analysis task is repeating the analysis for groups within your data set. In most analytics software, this is made trivial by the addition of a single statement, such as SAS’ “BY GROUP”. However, in R you must write a function and apply it by group. That function can be simple if you’re simply looking to print the results. However, if you wish to analyze those results further, you may need a series of function to apply. We’ll go over an example of each case, showing why it goes so quickly from simple to complex. This talk will use various tools from the popular plyr package to apply the functions.

Bob Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob is an Accredited Professional Statistician™ with 32 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

Quo Vadis KRUG? (3:30-4:00)

The Knoxville R User’s Group, or KRUG, started off with a series of workshops but it’s well past time to discuss where KRUGgers would like to take it. How often should we meet? How long should the talks be? Is the Friday afternoon timeslot good? Is meeting at UT sufficient, or should we move the meeting around (anyone have space?) Everything is up for discussion, so we’ll devote this final session to mull it over.

What R Has Been Missing

While R has more methods than any other analytics software, it has been missing a crucial feature found in most other packages. SPSS Modeler had it first, way back when they still called it Clementine. Then SAS Institute realized how crucial it was to productivity and added it to Enterprise Miner. As its reputation spread, it was added to RapidMiner, Knime, Statistica, Weka, and others. An early valiant attempt was made to add it to R. What is it? It’s the flowchart-style graphical user interface. And it will soon be available in R at last.

While menu-driven interfaces such as R Commander, Deducer or SPSS are somewhat easier to learn, the flowchart interface has two important advantages. First, you can often get a grasp of the big picture as you see steps such as separate files merging into one, or several analyses coming out of a particular data set (see figure). Second, and more important, you have a precise record of every step in your analysis. This allows you to repeat an analysis simply by changing the data inputs. Instead, menu-driven interfaces require that you switch to the programs that they create in the background if you need to automatically re-run many previous steps. That’s fine if you’re a programmer, but if you were a good programmer, you probably would not have been using that type of interface in the first place!

Alteryx GUI Screen Shot — Alteryx’ flowchart user interface, soon to be added to Revolution R Enterprise.

This week Revolution Analytics and Alteryx announced that future versions of Revolution R Enterprise will include Alteryx’ flowchart-style graphical user interface. Alteryx has traditionally focused on the analysis of spatial data, only adding predictive analytics in 2012 (skip 37 minutes into this presentation.) This partnership will also allow them to add Revolution’s big data features to various Alteryx products. Both companies are likely to get a significant boost in sales as a result.

While I expect both companies will benefit from this partnership, they could do much better. How? By making the Alteryx interface available for the community (free) version of R. If most R users were familiar with this interface, they would be much more likely to choose Alteryx’ tools when they needed them, instead of a competitor’s. When people needed big data tools for R, they’d be more likely to turn to Revolution Analytics. I am convinced that as great as R’s success has been, it could be greater still with a top-quality flowchart user interface that was freely available to all R users. Given the great advantages that this type of interface offers, it’s just a matter of time until a free version appears. The only question is: who will offer it?

[Update: It turns out that Alteryx is already offering a free version that works with the community version of R! See the comment from Dan Putler, product manager and one of the primary developers of Alteryx’s R-based predictive analytics and business data mining tools. I’ll be trying this out and will report my experiences in a future blog post.]

Learn R and/or Data Management from Home October 7-11

If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in October.

If you already know R, but want to learn more about data management, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. That course outline and registration is here.

If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.

Trends in the Analytics Job Market

Tracking the job market for statistics, analytics, data mining and the like used to be a major undertaking. However, on November 10, 2011 the world’s largest web site for job postings, Indeed.com, released a tool that allows you to examine trends of your own choosing. David Smith, of Revolution Analytics, recently used this tool to compare the job markets for SAS, R, SPSS and even COBOL.

As easy as this tool is to use, some things are inherently difficult to search for. The name of the fastest growing analytics package, R, is not easy to separate from all sorts of other uses of that letter. Adding logical conditions to the search will help get a more relevant answer, but there is no perfect search for this software. For example, adding “statistics,” as David did helps a lot, but it includes jobs that use statistics (but not R) for the extremely popular job categories:

R&D = Research and Development
H.R. = Human Resources
A/R = Accounts Receivable

In studying the results of many types of searches previously, I settled on a very long query that depended on R appearing in sentences like, “the successful job applicant will have expertise in SAS, SPSS or R.” Commas are ignored in Indeed.com searches, so I used the strings “SAS R”, “R SAS”, “R or SAS”, or “SAS or R”. In addition to SAS, I used the languages: Java, Minitab, Perl, Python, Ruby, SAS, SPSS, SQL and Stata. Unfortunately Indeed’s trend tool does not allow multiple long queries. As a result, my final query is as follows:

“r sas” or “sas r” or “r or sas” or “sas or r” or “r spss” or “spss r” or “r or spss” or “spss or r” or “r stata” or “stata r” or “r or stata” or “stata or r” or “r minitab” or “minitab r” or “r or minitab” or “minitab or r”

Note the confusing use of the word “or”. Outside of quotes, it’s a logical specification as in: X or Y. Withing quotes however, it becomes part of the search string itself, where the job description includes the word “or”. From this point onward when I talk about software, “used for statistical purposes,” I am referring to this precise definition (substituting the package at hand into the query, of course).

Even with this shortened search string, only two at a time would fit into Indeed’s search. Figure 1 shows the plot comparing R and SAS.

R VS SAS — Figure 1. The percentage of job postings across time for SAS and R. Both are focused on statistical uses via complex query.

We see that there is an overall pattern of growth for SAS. However, the growth seems to have stagnated from January, 2010 onward. At the most current time-point, the percentage of jobs for SAS is twice as high as for R. That 2-to-1 ratio is far smaller than I reported as recently as two months ago. Why the change? I had previously used complex logic to find R and simpler logic to find SAS. SAS is much easier to find, but by using simpler logic, I was essentially comparing R use for statistics to SAS use for all purposes. While that may sound like an irrelevant comparison, it is one that helps to show that R competes with SAS not just for statistical use, but also for its use in general data processing, report writing and related non-analytic tasks. Once a company is using SAS for report writing, they are more likely to use it for at least the fundamental statistics that come with Base SAS at no additional cost. Below is a graph (Fig. 2) comparing the search string “SAS !SATA !storage !firmware” to the complex R string from above. The exclamation point excludes terms, and the letters S.A.S. also stand for SCSI Attached Storage, which is related to computer firmware, not statistics. As a result, those jobs are excluded from the search.

R VS SAS for all uses — Figure 2. Percentage of job postings across time for all uses of SAS compared to only statistical uses of R.

We see that the number of jobs for SAS is now far more dominant than before. It’s difficult to assess from the graph but a direct job search shows there are 9 times as many jobs in this type of comparison (11,320 vs. 1,246).

How much broader is the general market for SAS compared to that focused on statistics? A direct job search for SAS for all uses yields 4.5 times as many jobs as a search that focuses on SAS for only statistical purposes (11,162 vs. 2,456). Interestingly, a similar comparison for SPSS results in only a 1.8-fold difference (3,231 vs. 1,808) while one for Stata is only 1.4 times higher (897 vs. 620). The ratios may reflect the breadth of use each package has in business reporting rather than statistical analysis.

Comparing job openings of R to those for SPSS, both for statistical purposes, yields the plot in Figure 3.

R VS SPSS — Figure 3. Percentage of job postings of SPSS and R and, both for statistical purposes.

We see that both SPSS and R show an overall upward trend, with R much steeper in the more recent years. The data for the most recent time period show that SPSS is still ahead, but not by a very wide margin.

Next, let us examine the trend in jobs for R and Stata (Fig. 4).

R VS Stata — Figure 4. Percentage of job postings across time for Stata and R, both used for statistical purposes.

We see that the jobs for Stata grew until mid-2010 where they have since been holding steady. Jobs for R have grown at much higher and steady rate since around January of 2009. In the most recent time period, there are roughly three times as many jobs for R as for Stata.

Given the power and ease of use of Indeed.com’s trend analyzer, I plan to switch the discussion over to it in future versions of The Popularity of Data Analysis Software. I’m very interested in hearing from people who can think of better ways to search for R using Indeed.com’s job trend tool.

If you would like to learn more about R or would like to learn more about Managing Data with R, you might consider registering for the upcoming webinar that I am presenting with the help of Revolution Analtyics.

(Note: All graphs and data were collected on August 5, 6, and 7, 2013)

Webinar: Managing Data with R

Before you can analyze data, it must be in the right form. Join Revolution Analytics and me this June 21st for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr and lubridate.

Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!

SAS Dominates Analytics Job Market; R up 42%

I’m continuing to gather and analyze data to update The Popularity of Data Analysis Software. In this installment I cover the latest employment figures.

Employment is important to us all, so what software skills are employers seeking? A thorough answer to this question would require a time consuming content analysis of job descriptions. However, we can get a rough idea by searching on job advertising sites. Indeed.com is the most popular job search site in the world. As their CEO and co-founder Paul Forster stated, it includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” I used a program that went there weekly and searched jobs descriptions for keywords such as “SPSS” or “Minitab.” This was repeated during the 2nd, 3rd and 4th weeks of March in 2012 and 2013. (The data were meant to be for the complete two years, but the automated process went awry.)

The abbreviation “SAS” is common in computer storage, so I avoided those by searching for “SAS !SATA !storage !firmware” (the exclamation point represents a logical “not”). I focused on R while avoiding related topics like “R&D” by using “R SAS” or “SAS R”, including each package in the graph. The data for 2013 are presented in Figure 11.

: Figure 11. Mean number of jobs per week available on Indeed.com for each software (March 2013) [last label should read “BMDP”].

SAS has a very substantial lead in job openings, with SPSS coming in second with just over a quarter of the jobs. R comes in third place with slightly more than half the jobs available for SPSS. Compared to R or Minitab, SAS has over seven times as many jobs available!

Since 2012, job descriptions that included SAS declined by 961 (7.3%) and those containing Minitab declined by 154 (8.7%). Jobs for R increased by 497 (42%) pushing it past Minitab into third place by a slim margin. In fact, all packages except for SPSS and Systat showed significant though much smaller absolute changes (via Holm-corrected paired t-tests (Table 2). Since these comparisons are based on only three data points in each year, I would not put much stock in most of them, but the 48% increase for R is notable.

Given the extreme dominance of SAS, a data analyst would do well to know it unless he or she was seeking a job in a field in which one of the other packages is dominant.

                  2012      2013   Difference  Ratio
1        SAS     13234     12272      -961      0.93
2       SPSS      3299      3289       -10      1.00
3          R      1196      1693       497      1.42
4    Minitab      1769      1615      -154      0.91
5      Stata       842       898        56      1.07
6        JMP       644       619       -25      0.96
7 Statistica        61        71        10      1.17
8     Systat        14        15         1      1.07
9       BMDP         6        10         3      1.53

Table 2. Number of jobs on Indeed.com that list each software in March of 2012 and 2013. Changes are significant for all software except SPSS and Systat.

Forecast Update: Will 2014 be the Beginning of the End for SAS and SPSS?

[Since this was originally published in 2013, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]

I recently updated my plots of the data analysis tools used in academia in my ongoing article, The Popularity of Data Analysis Software. I repeat those here and update my previous forecast of data analysis software usage.

Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. As you can see in Fig. 1, the use of most analytic software is growing rapidly in academia. The only one growing slowly, very slowly, is Statistica.

Fig_7b_ScholarlyImpactLittle6 — Figure 1. The growth of data analysis packages with SAS and SPSS removed.

While they remain dominant, the use of SAS and SPSS has been declining rapidly in recent years. Figure 2 plots the same data, adding SAS and SPSS and dropping JMP and Statistica (and changing all colors and symbols!)

Fig_7a_ScholarlyImpactBig6 — Figure 2. Scholarly use of data analysis software with SAS and SPSS added, JMP and Statistica removed.

Since Google changes its search algorithm, I recollect all the data every year. Last year’s plot (below, Fig. 3) ended with the data from 2011 and contained some notable differences. For SPSS, the 2003 data value is quite a bit lower than the value collected in the current year. If the data were not collected by a computer program, I would suspect a data entry error. In addition, the old 2011 data value in Fig. 3 for SPSS showed a marked slowing in the rate of usage decline. In the 2012 plot (above, Fig. 2), not only does the decline not slow in 2011, but both the 2011 and 2012 points continue the sharp decline of the previous few years.

Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.

Here is the data from Google Scholar:

         R   SAS SPSS   Stata
1995     7  9120 7310      24
1996     4  9130 8560      92
1997     9 10600 11400    214
1998    16 11400 17900    333
1999    25 13100 29000    512
2000    51 17300 50500    785
2001   155 20900 78300    969
2002   286 26400 66200   1260
2003   639 36300 43500   1720
2004  1220 45700 156000  2350
2005  2210 55100 171000  2980
2006  3420 60400 169000  3940
2007  5070 61900 167000  4900
2008  7000 63100 155000  6150
2009  9320 60400 136000  7530
2010 11500 52000 109000  8890
2011 13600 44800  74900 10900
2012 17000 33500  49400 14700

ARIMA Forecasting

I forecast the use of R, SAS, SPSS and Stata five years into the future using Rob Hyndman’s forecast package and the default settings of its auto.arima function. The dip in SPSS use in 2002-2003 drove the function a bit crazy as it tried to see a repetitive up-down cycle, so I modeled the SPSS data only from its 2005 peak onward. Figure 4 shows the resulting predictions.

The forecast shows R and Stata surpassing SPSS and SAS this year (2013), with Stata coming out on top. It also shows all scholarly use of SPSS and SAS stopping in 2014 and 2015, respectively. Any forecasting book will warn you of the dangers of looking too far beyond the data and above forecast does just that.

Guestimate Forecasting

So what will happen? Each reader probably has his or her own opinion, here’s mine. The growth in R’s use in scholarly work will continue for three more years at which point it will level off at around 25,000 articles in 2015. This growth will be driven by:

The continued rapid growth in add-on packages
The attraction of R’s powerful language
The near monopoly R has on the latest analytic methods
Its free price
The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (IBM is loosening up on this a bit)

What will slow R’s growth is its lack of a graphical user interface that:

Is powerful
Is easy to use
Provides direct cut/paste access to journal style output in word processor format
Is standard, i.e. widely accepted as The One to Use
Is open source

While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its full range of capabilities and its speed of use. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but with so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used software.

The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use are not sure which GUI to teach, so they continue teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a respectable GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.

The use of SPSS for scholarly work will decline less sharply in 2013 and will level off in in 2015 at around 27,000 articles because:

Many of the people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
Many of the people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
Many of the people who needed more interactive visualization have already switched to JMP

The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.

Although Stata is currently the fastest growing package, it’s growth will slow in 2013 and level off by 2015 at around 23,000 articles, leaving it in fourth place. The main cause of this will be inertia of users of the established leaders, SPSS and SAS, as well as the competition from all the other packages, most notably R. R and Stata share many strengths and with one being free, I doubt Stata will be able to beat R in the long run.

The other packages shown in Fig. 1 will also level off around 2015, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.

The future of SAS Enterprise Miner and IBM SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes. Both companies could significantly shift their future by combining their two main GUIs. Imagine a menu & dialog-box system that draws a simple flowchart as you do things. It would be easy to learn and users would quickly get the idea that you could manipulate the flowchart directly, increasing its window size to make more room. The flowchart GUI lets you see the big picture at a glance and lets you re-use the analysis without switching from GUI to programming, as all other GUI methods require. Such a merger could give SAS and SPSS a game-changing edge in this competitive marketplace.

So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to do your own forecasts and add links to them in the comment section below. You can use my data or follow the detailed blog at Librestats to collect your own. One thing is certain: the coming decade in the field of analytics will be interesting indeed!