Uncategorized | r4stats.com

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution R Enterprise, or SAS vs. the World Programming System (WPS) or Carolina, except when it is particularly easy such as comparing the company Pagerank figures.

These packages are all included in the first section on jobs, but later sections are older (each contains a date) and do not cover an as extensive set of software. I’ll add those as I can and announce the changes on Twitter where you can follow me as @BobMuenchen.

Introduction

When choosing a tool for data analysis, now more broadly referred to as analytics, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output in the form you prefer (e.g. cut & paste into a word processor vs. LaTeX integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it?

There are many ways to measure popularity or market share and each has its advantages and disadvantages. Here they are, in approximate order of usefulness:

Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
Published Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. What makes this measure particularly easy to gather is that consolidators like Tal Galili have created blog consolidation sites like R-Bloggers.com which make it easy to count the blogs. Previously that had been a difficult task.
Web Site Popularity – how does Google provide the most popular search results at the top of its response to your queries? A major component of that answer comes from the total number of web pages that point to any given web site. That’s known as a site’s PageRank. This is objective data, and for sites that clearly focus on analytics, it’s unbiased. However, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, and probably fewer that discuss analytics point to it as well. But it may be impractical to tell which is which.
Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment.
Discussion Forums – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others.
Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
IT Research Firms – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.

Job Advertisements

One of the best ways to measure the popularity or market share of software for analytics is to count the number of job advertisements for each. Indeed.com is the biggest job site in the U.S. making its sample the most representative of the current job market. As their CEO and co-founder Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” To demonstrate just how dominant its lead is, a search for SPSS (on 2/19/14) showed more than ten times as many jobs on Indeed.com as on its well-known competitor, Monster.com. Indeed.com also has superb search capabilities and it even includes a tool for tracking long-term trends.

Searching for analytics jobs using Indeed.com can be easy, but it can also be very tricky. For many of the analytics software that required only a simple search on its name. However, for software that’s hard to locate (e.g. R) or that is general purpose (e.g. Java) it required complex searches and/or some rather tricky calculations which are described here. All of the graphs in this section use those procedures to make the required queries.

Figure 1a shows that Java and SAS are in a league of their own, with around 50% more analytics jobs than Python or C, C++/C# and twice as much as R. (The three aforementioned C variants are combined in a single search since job advertisements usually seek any of them). Python and C/C++/C# come next at an almost identical level of popularity. That’s not too surprising as many advertisements for analytics jobs that use programming mention both together.

Figure 1a. The number of analytics jobs for the more popular software (2/2014).

R resides in an interestingly large gap between the other domain-specific languages, SAS and SPSS. This is the first estimate I’ve done that shows that the job market for R has not only caught up with SPSS, but surpassed it by close to double the number of job postings. I knew my previous estimates for R jobs was low, but I had not yet thought of a better way to estimate the total. From SPSS on down, there’s a smooth decline. Enterprise Miner is the only data-mining-specific software to make the cutoff of at least 100 jobs. If I plotted all the software below that point, they would all pile up on the y-axis, appearing to have almost no jobs. Relatively speaking, they don’t!

Software that did not make that cut and are not displayed on the graph are: Alteryx (68), Statistica (67), RapidMiner (38), SPSS Modeler (36), KXEN (28), KNIME (26), Julia (15), Statgraphics (11), Systat (10), BMDP (8), Angos (6), Lavastorm (5), NCSS (4), Salford SPM etc. (3), Teradata Miner (2) and Oracle Data Mining (2).

It’s important to note that the values shown in Figure 1a are single points in time. The number of jobs for the more popular software do not change much from day to day, but each software has an overall trend that shows how the demand for jobs changes across the years. You can plot such trends using Indeed.com’s Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time means they both have to fit in the same query limit allowed by Indeed.com. Those details are described here.

I’m particularly interested in trends involving R, so let’s look at a couple of comparisons. Figure 1b compares the number of analytics jobs available for R and SPSS across time. Analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.

Fig_1b_RvSPSS_2014-2-22 — Figure 1b. Analytics job trends for R and SPSS. Note that the legend labels are truncated due to the very long size of the query.

We know from Figure 1a that SAS is still far ahead of R in analytics job postings. How far does R have to go to catch up with SAS? Figure 1c provides one perspective. It would be nice to have the data to forecast when R’s growth curve will catch up with SAS’s, but Indeed.com does not provide the raw data. However, we can use the approximate slope of each line to get a rough estimate. If jobs for SAS stay level and those for R continue to grow linearly as they have since January 2010, then R will catch up in 3.35 years. If instead the demand for SAS jobs that started in January of 2012 continues, then R will catch up in 1.87 years.

Figure 1c. The trend in analytics jobs for R and SAS. — Figure 1c. Analytics job trends for R and SAS. Legend labels are truncated due to long query length.

A debate has been taking place on the Internet regarding the relative place of Python and R. Ironically, this debate about software to do data analytics has involved very little actual data. However it is possible now to at least study the job trends. Figure 1a showed us that Python is well out in front of R, at least on that single day the searches were run. What has the data looked like over time? The answer is in Figure 1d.

Figure 1c. Jobs trends for R and Python (2/22/14). — Figure 1d. Jobs trends for R and Python. Legend labels are truncated due to long query length.

Note that in this graph, Python appears to have a relatively slight advantage while in Figure 1a it had a huge one. The final point on the trend graph was done only two days after the queries used in Figure 1a, and that data changed very little in the meantime. The difference is due to the fact that Indeed.com has a limit on query length. Here is the query used for Figure 1c, and the analytic terms it contains were fewer than the one used for Figure 1a.

R 
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")
!"R D" !"A R" !"H R" !"R N" 
!toys !kids !" R Walgreen" !walmart
!"HVAC R" !"R Bard" 
,
python
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")

The detailed description regarding the construction of all the queries used in Figures 1a through 1c is located here.

==============================================================

At this point, the rest of The Popularity of Data Analysis Software will continue, offering many additional perspectives on measuring analytics market share. However, until I update those sections in the coming months, they will not cover as broad a range of software. Stay tuned on Twitter, by following @BobMuenchen.

If you know SAS, SPSS or Stata and have not yet learned R, you can join me for this web-based workshop aimed at translating your knowledge into R. The next workshop begins on April 21. If you do know R and would like to learn more, you might enjoy taking Managing Data with R. The next time I’m offering that is on April 25.

Knoxville R User’s Group Meeting November 1

The next meeting of the Knoxville R User’s Group will consist of four 20-minute talks followed by an open planning session. It will take place on Friday, November 1, from 2:00 p.m. to 4:00 p.m. at The University of Tennessee, Haslam Business Administration Building, room 403 (1000 Volunteer Blvd., Knoxville, TN). RSVP at http://www.meetup.com/Knoxville-R-Users-Group. The topics and biographical information regarding the speakers are listed below.

Automated Forecasting using R: A Stock Market Example (2:00-2:20)

R’s forecast package can be used to generate automated ARIMA model forecasts in a method comparable to SAS Forecast Server. This talk will demonstrate how to use the R ‘quantmod’ package to query financial data from Yahoo finance and then utilize the data in the forecast package to automatically produce point forecasts and prediction intervals. Examples of how to use each package, including diagnostic plots and results, will be included.

Josh Price earned a BS and MS in statistics, both from the University of Tennessee. While working on his Master’s, he worked as a graduate assistant for Research Computing Support. After graduating, Josh worked for 7 years in industry as a consultant in both business and engineering. In January 2013, he returned to UT to work as a statistical consultant where he assists students, faculty, and staff with statistical aspects of their theses, dissertations and various research projects. Josh’s current interests include programming, forecasting methods, and quantitative finance.

BioGeoBEARS: An R package for inference and model testing in historical biogeography (2:20-2:40)

Phylogenetic biogeography is traditionally concerned with the inference of ancestral geographic ranges on a phylogeny, and of inferring the history of events that lead to present-day distributions. The field has been dominated for decades by debates about whether vicariance or dispersal is the dominant process. This talk will demonstrate, using BioGeoBEARS, that assumptions about the processes can be subject to statistical inference from the data, and show that founder-event speciation is a crucial process that has been left out of the current biogeography programs DIVA, LAGRANGE, and BayArea.

Nicholas J. Matzke is a Postdoctoral Fellow in Mathematical Biology at the National Institute for Mathematical and Biological Synthesis (NIMBioS, www.nimbios.org)) at UT Knoxville, and a member of Brian O’Meara’s lab in the Department of Ecology and Evolutionary Biology. He is also the author of the BioGeoBEARS package.

Elevating R to Supercomputers (2:40-3:00)

The biggest supercomputing platforms in the world are distributed memory machines, but the overwhelming majority of the development for parallel R infrastructure has been devoted to small shared memory machines. Additionally, most of this development focuses on task parallelism, rather than data parallelism. But as big data analytics becomes ever more attractive to both users and developers, it becomes increasingly necessary for R to add distributed computing infrastructure to support this kind of big data analytics which utilize large distributed resources. The Programming with Big Data in R (pbdR) project aims to provide such infrastructure, elevating the R language to these massive-scale computing platforms. This talk will cover some of the early successes of the pbdR project, benchmarks, challenges, and future plans.

Drew Schmidt is a researcher at the University of Tennessee’s National Institute for Computational Sciences, and is primarily interested in the intersection of mathematics, statistics, and high-performance computing. He is co-lead developer of the Programming with Big Data in R (pbdR) project, which elevates the statistics programming language R to large distributed computing platforms.

BREAK (3:00-3:10)

Analyzing Data by Group Using R’s plyr Package (3:10-3:30)

A common data analysis task is repeating the analysis for groups within your data set. In most analytics software, this is made trivial by the addition of a single statement, such as SAS’ “BY GROUP”. However, in R you must write a function and apply it by group. That function can be simple if you’re simply looking to print the results. However, if you wish to analyze those results further, you may need a series of function to apply. We’ll go over an example of each case, showing why it goes so quickly from simple to complex. This talk will use various tools from the popular plyr package to apply the functions.

Bob Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob is an Accredited Professional Statistician™ with 32 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

Quo Vadis KRUG? (3:30-4:00)

The Knoxville R User’s Group, or KRUG, started off with a series of workshops but it’s well past time to discuss where KRUGgers would like to take it. How often should we meet? How long should the talks be? Is the Friday afternoon timeslot good? Is meeting at UT sufficient, or should we move the meeting around (anyone have space?) Everything is up for discussion, so we’ll devote this final session to mull it over.

Trends in the Analytics Job Market

Tracking the job market for statistics, analytics, data mining and the like used to be a major undertaking. However, on November 10, 2011 the world’s largest web site for job postings, Indeed.com, released a tool that allows you to examine trends of your own choosing. David Smith, of Revolution Analytics, recently used this tool to compare the job markets for SAS, R, SPSS and even COBOL.

As easy as this tool is to use, some things are inherently difficult to search for. The name of the fastest growing analytics package, R, is not easy to separate from all sorts of other uses of that letter. Adding logical conditions to the search will help get a more relevant answer, but there is no perfect search for this software. For example, adding “statistics,” as David did helps a lot, but it includes jobs that use statistics (but not R) for the extremely popular job categories:

R&D = Research and Development
H.R. = Human Resources
A/R = Accounts Receivable

In studying the results of many types of searches previously, I settled on a very long query that depended on R appearing in sentences like, “the successful job applicant will have expertise in SAS, SPSS or R.” Commas are ignored in Indeed.com searches, so I used the strings “SAS R”, “R SAS”, “R or SAS”, or “SAS or R”. In addition to SAS, I used the languages: Java, Minitab, Perl, Python, Ruby, SAS, SPSS, SQL and Stata. Unfortunately Indeed’s trend tool does not allow multiple long queries. As a result, my final query is as follows:

“r sas” or “sas r” or “r or sas” or “sas or r” or “r spss” or “spss r” or “r or spss” or “spss or r” or “r stata” or “stata r” or “r or stata” or “stata or r” or “r minitab” or “minitab r” or “r or minitab” or “minitab or r”

Note the confusing use of the word “or”. Outside of quotes, it’s a logical specification as in: X or Y. Withing quotes however, it becomes part of the search string itself, where the job description includes the word “or”. From this point onward when I talk about software, “used for statistical purposes,” I am referring to this precise definition (substituting the package at hand into the query, of course).

Even with this shortened search string, only two at a time would fit into Indeed’s search. Figure 1 shows the plot comparing R and SAS.

R VS SAS — Figure 1. The percentage of job postings across time for SAS and R. Both are focused on statistical uses via complex query.

We see that there is an overall pattern of growth for SAS. However, the growth seems to have stagnated from January, 2010 onward. At the most current time-point, the percentage of jobs for SAS is twice as high as for R. That 2-to-1 ratio is far smaller than I reported as recently as two months ago. Why the change? I had previously used complex logic to find R and simpler logic to find SAS. SAS is much easier to find, but by using simpler logic, I was essentially comparing R use for statistics to SAS use for all purposes. While that may sound like an irrelevant comparison, it is one that helps to show that R competes with SAS not just for statistical use, but also for its use in general data processing, report writing and related non-analytic tasks. Once a company is using SAS for report writing, they are more likely to use it for at least the fundamental statistics that come with Base SAS at no additional cost. Below is a graph (Fig. 2) comparing the search string “SAS !SATA !storage !firmware” to the complex R string from above. The exclamation point excludes terms, and the letters S.A.S. also stand for SCSI Attached Storage, which is related to computer firmware, not statistics. As a result, those jobs are excluded from the search.

R VS SAS for all uses — Figure 2. Percentage of job postings across time for all uses of SAS compared to only statistical uses of R.

We see that the number of jobs for SAS is now far more dominant than before. It’s difficult to assess from the graph but a direct job search shows there are 9 times as many jobs in this type of comparison (11,320 vs. 1,246).

How much broader is the general market for SAS compared to that focused on statistics? A direct job search for SAS for all uses yields 4.5 times as many jobs as a search that focuses on SAS for only statistical purposes (11,162 vs. 2,456). Interestingly, a similar comparison for SPSS results in only a 1.8-fold difference (3,231 vs. 1,808) while one for Stata is only 1.4 times higher (897 vs. 620). The ratios may reflect the breadth of use each package has in business reporting rather than statistical analysis.

Comparing job openings of R to those for SPSS, both for statistical purposes, yields the plot in Figure 3.

R VS SPSS — Figure 3. Percentage of job postings of SPSS and R and, both for statistical purposes.

We see that both SPSS and R show an overall upward trend, with R much steeper in the more recent years. The data for the most recent time period show that SPSS is still ahead, but not by a very wide margin.

Next, let us examine the trend in jobs for R and Stata (Fig. 4).

R VS Stata — Figure 4. Percentage of job postings across time for Stata and R, both used for statistical purposes.

We see that the jobs for Stata grew until mid-2010 where they have since been holding steady. Jobs for R have grown at much higher and steady rate since around January of 2009. In the most recent time period, there are roughly three times as many jobs for R as for Stata.

Given the power and ease of use of Indeed.com’s trend analyzer, I plan to switch the discussion over to it in future versions of The Popularity of Data Analysis Software. I’m very interested in hearing from people who can think of better ways to search for R using Indeed.com’s job trend tool.

If you would like to learn more about R or would like to learn more about Managing Data with R, you might consider registering for the upcoming webinar that I am presenting with the help of Revolution Analtyics.

(Note: All graphs and data were collected on August 5, 6, and 7, 2013)

Webinar: Managing Data with R

Before you can analyze data, it must be in the right form. Join Revolution Analytics and me this June 21st for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr and lubridate.

Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!

Knoxville R Users Group Formed, Free Training Offered

R is popular free and open-source software for graphics and data analytics. The Knoxville R Users Group is being formed to help people learn R and improve their skills with it. Three departments of The University of Tennessee are working together to get it started: the Office of Information Technology, the National Institute for Computational Science’s RDAV group (Remote Data Analysis and Visualization) and the Department of Statistics, Operations, and Management Science. The latter’s Business Analytics program was recently ranked among the top 20 such departments in the U.S.

To start the group off I’ll teach a hands-on introductory workshop on R on April 29th from 8 a.m. to 5:00 p.m. The topics covered are described at http://r4stats.com/workshops/r4sas-spss-stata/. Note that you do not need to know SAS, SPSS or Stata, but the workshop will include numerous warnings where R works very differently from those other packages. The workshop is free and open to the Knoxville area public. UT faculty, staff and students can register at: http://oit.utk.edu/training and non-UT people can register at the user group web site: http://www.meetup.com/Knoxville-R-Users-Group. Course location, materials including slides, programs, practice data sets and exercises will be available on http://www.meetup.com/Knoxville-R-Users-Group on Saturday, April 27 (if not before).

R Tackles Big Garbage

April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.

While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.

While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”

Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”

The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”

Comparing Transformation Styles: attach, transform, mutate and within

There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!

Here’s a data frame with one variable:

> mydata <- data.frame(x = 1:5)
> mydata

  x
1 1
2 2
3 3
4 4
5 5

Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. The simplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:

> mydata.new <- mydata

> mydata.new$x2 <- mydata.new$x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

That works, but I had to type more characters for the “mydata.new” part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use the attach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x” instead of “mydata$x”. However, the attach function is tricky to use. Here’s the most common mistake made by beginners:

> mydata.new <- mydata
> attach(mydata.new)
> x2 <- x  ^ 2 
> x3 <- x2 + 100

> mydata.new
  x
1 1
2 2
3 3
4 4
5 5

There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:

> ls()
[1] "mydata" "mydata.new" "x2" "x3"

> x2; x3

[1]  1  4  9 16 25
[1] 101 104 109 116 125

I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detach mydata.new so we can start fresh.

> rm(x2, x3)
> detach(mydata.new)

We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:

> mydata.new <- mydata
> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- x2 + 100
Error: object 'x2' not found

> detach(mydata.new)

The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2” you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:

> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

> detach(mydata.new)

That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.

The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.

> mydata.new <- transform(mydata, x2 = x ^ 2)

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-”.

Eliminating the tedious repetition of “mydata$…” makes the formulas easier to enter, read and debug. However, the transform function has a problem: it is unable to use a variable that it just created. For example:

> mydata.new <- transform(mydata,
+                 x2 = x  ^ 2,
+                 x3 = x2 + 100
+                 )

Error in eval(expr, envir, enclos) : object 'x2' not found

We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.

In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.

Hadley Wickham’s dplyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:

> library("dplyr")
> mydata.new <- mutate(mydata,
+                 x2 = x ^ 2,
+                 x3 = x2 + 100
+                 )

> mydata.new 

x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:

> mydata.new <- mutate(mydata,
+                 x2 = x  ^ 2, 
+                 x2 = x2 + 100
+                 )

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using the ifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)

Finally, we come to the within function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:

> mydata.new <- within(mydata, {
+              x2 <- x  ^ 2
+              x3 <- x2 + 100
+              } )

> mydata.new

  x  x3 x2
1 1 101  1
2 2 104  4
3 3 109  9
4 4 116 16
5 5 125 25

Notice that we’re back to using the assignment operator “<-” and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.

When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:

> mydata.new <- within(mydata, {
+               x2 <- x  ^ 2
+               x2 <- x2 + 100
+               } )

> mydata.new

  x  x2
1 1 101
2 2 104
3 3 109
4 4 116
5 5 125

Since the within function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!

That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.

R for SAS, SPSS, Stata Users Workshop Redesigned

My workshop R for SAS, SPSS and Stata Users has been popular over the years, but it’s time for an overhaul. A common request has been to simplify it, so I have moved data management to a separate 4-hour workshop, Managing Data with R. This makes it much easier to absorb the basics in the remaining two 4-hour sessions. When you’re ready for more, you can take the other workshop which I’ll be offering several time per year. Detailed course outlines are available at the workshop links above and at the Revolution Analytics web site.

Specifying Variables in R

R has several ways to specify which variables to use in an analysis. Some of the most frustrating errors can result from not understanding the order in which R searches for variables. This post demonstrates that order, hopefully smoothing your future use of R.

If all your variables are vectors in your workspace, using them in an analysis is easy: simply name them. For example, you could build a linear model (regression) using the lm function like this:

lm(y ~ x)

However, data frames exist for a good reason. They help organize variables and keep the values of each observation (the rows) locked together. For example, when you sort a data frame, all the rows of a data frame are moved, not just the single variable you’re sorting on. Once variables are stored in a data frame however, referring to them gets more complicated. R can include variables from multiple places (e.g. two data frames or a data frame and the workspace) so it becomes important to know your options and how R views them.

You can specify the names of both a data frame and a variable using the compound forms mydata$myvar or mydata[“myvar”]. However, that often means that you have to type the name of the data frame quite a lot.

If you use the form “with(mydata,…” then R will look in that data frame for the “short” variable names before it looks elsewhere, like in your workspace. That allows you to type the data frame name only once per function call, but in a long program you would still end up typing it a lot.

Modeling functions in R often let you specify “data = mydata” allowing you to use short variable names in formulas like “y ~ x”. The result is like the “with” function, you must type the data frame name once per function call. (SAS users take note: variables used outside of formulas will not be found with this approach!)

Finally, you can attach the data frame with “attach(mydata)”. This copies the variables into a temporary space that lets you then refer to them by their short names. This has the big advantage of allowing all the following function calls to use short variable names. Unfortunately, it has the big disadvantage of being confusing. Confusion #1 is that people feel that variables they create will go into the data frame automatically; they will not. Unless you specify a data frame using either mydata$newvar or mydata[“newvar”], new variables are created in your workspace. Confusion #2 is that R will look in your workspace before it looks at the attached versions of variables. So if variables with the same names exist there, those will be used instead. Confusion #3 is that even though detach(mydata) will reverse the process, if you run your program multiple times, you may have attached the data multiple times and detaching once does not fully undo the attached state. As confusing at that is, I use attach frequently and rarely get burned by it.

For example, with variables x and y stored in mydata (and nowhere else) you could do a linear regression model using any one of these approaches:

lm(mydata$y ~ mydata$x)

lm(mydata[“y”] ~ mydata[“x”])

with(mydata, lm(y ~ x))

lm(y ~ x, data = mydata)

attach(mydata)
lm(y ~ x)

As if that weren’t complicated enough, both x and y do not have to both be in the same data frame! The x variable could be in mydata and the y variable could be in the workspace or in an attached version of mydata or some other data frame. That would be dangerous, of course, since it would be up to you to ensure that the values of each observation match or the resulting model would be nonsense. However, this kind of flexibility can also be very useful.

With all this flexibility, it’s important to know the order in which R chooses variables. A simple example can show us the order R uses. Here I am creating four data frames whose x and y variables will have a slope that is indicated by the data frame name. For example, the variables in df10 have a slope of 10. This will make it easy for us to see which version of the variables R is using.

> y <- c(1,2,3,4,5,6,7,8,9,10)
> x <- c(1,2,5,5,5,5,5,8,9,10)
> df1    <- data.frame(x, y)     
> df10   <- data.frame(x, y = y*10  )
> df100  <- data.frame(x, y = y*100 )
> df1000 <- data.frame(x, y = y*1000)
> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Notice that I have deleted the original x and y variables so at the moment, varibles x and y exist only within the data frames. Running a regression with lm(y ~ x) will not work since R does not look into data frames unless you tell it to. Even if it did, it would have no way to know which set of x’s and y’s to use. Next I will take two different approaches to “selecting” a data frame. I attach df1 and copy the variables from df10 into the workspace.

> attach(df1)
> y <- df10$y
> x <- df10$x

Next, I do something rarely useful, calling a linear model using both “with” and “data=”. Which will dominate?

> with(df100, lm(y ~ x, data = df1000))

Call:
lm(formula = y ~ x, data = df1000)

Coefficients:
(Intercept)            x  
          0         1000

Since the slope is 1000, it’s clear that the “data=” argument was dominant. So R would look there first. If it found both x and y, it would stop looking. But if it only found one variable, it would continue to look elsewhere for the other. If the other variable where in the “with” data frame, it would then use it.

Next I’ll remove the “data” argument and see what happens.

> with(df100, lm(y ~ x))

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0          100

This time the “with” data frame was used for both variables. If variable either had not been in that data frame, R would have continued to look in the workspace and in the attached copy. But which would it use first? Next, I’m not specifying a data frame at all.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0           10

The slope of 10 tells us that it found the copies of x and y that I copied from df10 into the workspace. Let’s delete those variables and list the objects in our workspace to ensure that they’re gone.

> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Both x and y are clearly gone. So lets see if we can still use them.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0            1

We deleted x and y but we can still use them! However, we see from the slope of 1 that R has used a different pair of x and y variables. They’re the ones that were copied to my search path when I used “attach(myDf1)”. I had to remember that I had attached them. It’s this kind of confusion that makes many R users avoid using attach. Finally, I’ll detach df1 and see what happens.

> detach(df1)
> lm(y ~ x)
Error in eval(expr, envir, enclos) : object 'y' not found

Now, even though all the data frames in our workspace contain an x and y variable, R does not look inside to find any of them. Even if it did, it would have no way of know which to choose.

We have seen that R looks in various places for variables. In order, they are: what you specify in “data=”, using “with(mydata,…”, your workspace and finally attached copies of your data frame. The most recently attached copies are the ones it will use first. I hope this will help you use R with both less typing and less confusion.

Things I’ve Learned About WordPress.com

I’m almost done moving this site from Google Sites to WordPress. This post describes some of the things I’ve learned about WordPress.com.

By default, WordPress.com makes your site look like a blog. I prefer it look like a website that contains a blog. You can change that in Site Admin under Settings> Reading. The Front Page Displays box determines what people will see when they arrive. By default, that’s your latest blog entry. You can change that to any page you like.

WordPress allows a very limited set of files to download. My book support files are in R, SAS, SPSS, Stata, sas7bdat, etc., so I zip them up into a single file. Since WordPress.com does not allow you to distribute zip files, I had to put them in my DropBox public folder and link to them from WordPress.com.

You can organize your menus by either parent/child relationships among pages or by using custom menus. Custom menus have the advantage of allowing all pages to be at the root of your site, keeping nice short URLs like “https://r4stats.com/popularity”. However, many site templates do not support custom menus, so you are very constrained in your choice of templates. Using the “popularity” article as an example, I created a page called “Articles” and let it be the parent. So now the URL is “https://r4stats.com/articles/popularity”. That’s too bad since there are old links out there that use the short version. I’ll put notes on them to redirect people. I could use “redirects,” but I would prefer people to see the links and note the changes. That will address the many links that used “https://sites.google.com/site/r4statistics/” rather than the shorter equivalent, “http:r4stats.com”.

One of the most frustrating problems I saw resulted from WordPress.com not letting you change the URL to the one you wanted. For example, I wanted my Miscellaneous page to have the URL http://r4stats.wordpress.com/misc, but it insisted on adding a “-2” to the end of it as in http://r4stats.wordpress.com/misc-2. I looked all over for a page that already used the “misc” link but found none. Then it occurred to me that it might be in the trash. It was. When I deleted it, then it would allow me to reuse the simpler link.

I spent a crazy amount of time figuring out how to get my program code examples to display in Courier or any similar monospaced font. I paid the $30 fee to edit my cascading style sheet (CSS), only to find that allowed me to change the whole page to Courier. I finally found that you click the upper-right-most button on the toolbar labeled Show/Hide Kitchen Sink. That made a style menu appear. It contains a style named “Preformatted,” which is monospaced. Tal Galili helpfully pointed me to a wonderful article he had written on how to make R code look nice on WordPress.

So that’s the sum of my lessons so far. On the whole, I preferred the website tool at Google Sites, but I do like having WordPress blogs built right into the site. That makes it easy to hook the blogs into R-Bloggers.com and PROC-X.com.