Bob Muenchen | r4stats.com

Forecast Update: Will 2014 be the Beginning of the End for SAS and SPSS?

[Since this was originally published in 2013, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]

I recently updated my plots of the data analysis tools used in academia in my ongoing article, The Popularity of Data Analysis Software. I repeat those here and update my previous forecast of data analysis software usage.

Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. As you can see in Fig. 1, the use of most analytic software is growing rapidly in academia. The only one growing slowly, very slowly, is Statistica.

Fig_7b_ScholarlyImpactLittle6 — Figure 1. The growth of data analysis packages with SAS and SPSS removed.

While they remain dominant, the use of SAS and SPSS has been declining rapidly in recent years. Figure 2 plots the same data, adding SAS and SPSS and dropping JMP and Statistica (and changing all colors and symbols!)

Fig_7a_ScholarlyImpactBig6 — Figure 2. Scholarly use of data analysis software with SAS and SPSS added, JMP and Statistica removed.

Since Google changes its search algorithm, I recollect all the data every year. Last year’s plot (below, Fig. 3) ended with the data from 2011 and contained some notable differences. For SPSS, the 2003 data value is quite a bit lower than the value collected in the current year. If the data were not collected by a computer program, I would suspect a data entry error. In addition, the old 2011 data value in Fig. 3 for SPSS showed a marked slowing in the rate of usage decline. In the 2012 plot (above, Fig. 2), not only does the decline not slow in 2011, but both the 2011 and 2012 points continue the sharp decline of the previous few years.

Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.

Here is the data from Google Scholar:

         R   SAS SPSS   Stata
1995     7  9120 7310      24
1996     4  9130 8560      92
1997     9 10600 11400    214
1998    16 11400 17900    333
1999    25 13100 29000    512
2000    51 17300 50500    785
2001   155 20900 78300    969
2002   286 26400 66200   1260
2003   639 36300 43500   1720
2004  1220 45700 156000  2350
2005  2210 55100 171000  2980
2006  3420 60400 169000  3940
2007  5070 61900 167000  4900
2008  7000 63100 155000  6150
2009  9320 60400 136000  7530
2010 11500 52000 109000  8890
2011 13600 44800  74900 10900
2012 17000 33500  49400 14700

ARIMA Forecasting

I forecast the use of R, SAS, SPSS and Stata five years into the future using Rob Hyndman’s forecast package and the default settings of its auto.arima function. The dip in SPSS use in 2002-2003 drove the function a bit crazy as it tried to see a repetitive up-down cycle, so I modeled the SPSS data only from its 2005 peak onward. Figure 4 shows the resulting predictions.

The forecast shows R and Stata surpassing SPSS and SAS this year (2013), with Stata coming out on top. It also shows all scholarly use of SPSS and SAS stopping in 2014 and 2015, respectively. Any forecasting book will warn you of the dangers of looking too far beyond the data and above forecast does just that.

Guestimate Forecasting

So what will happen? Each reader probably has his or her own opinion, here’s mine. The growth in R’s use in scholarly work will continue for three more years at which point it will level off at around 25,000 articles in 2015. This growth will be driven by:

The continued rapid growth in add-on packages
The attraction of R’s powerful language
The near monopoly R has on the latest analytic methods
Its free price
The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (IBM is loosening up on this a bit)

What will slow R’s growth is its lack of a graphical user interface that:

Is powerful
Is easy to use
Provides direct cut/paste access to journal style output in word processor format
Is standard, i.e. widely accepted as The One to Use
Is open source

While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its full range of capabilities and its speed of use. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but with so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used software.

The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use are not sure which GUI to teach, so they continue teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a respectable GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.

The use of SPSS for scholarly work will decline less sharply in 2013 and will level off in in 2015 at around 27,000 articles because:

Many of the people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
Many of the people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
Many of the people who needed more interactive visualization have already switched to JMP

The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.

Although Stata is currently the fastest growing package, it’s growth will slow in 2013 and level off by 2015 at around 23,000 articles, leaving it in fourth place. The main cause of this will be inertia of users of the established leaders, SPSS and SAS, as well as the competition from all the other packages, most notably R. R and Stata share many strengths and with one being free, I doubt Stata will be able to beat R in the long run.

The other packages shown in Fig. 1 will also level off around 2015, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.

The future of SAS Enterprise Miner and IBM SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes. Both companies could significantly shift their future by combining their two main GUIs. Imagine a menu & dialog-box system that draws a simple flowchart as you do things. It would be easy to learn and users would quickly get the idea that you could manipulate the flowchart directly, increasing its window size to make more room. The flowchart GUI lets you see the big picture at a glance and lets you re-use the analysis without switching from GUI to programming, as all other GUI methods require. Such a merger could give SAS and SPSS a game-changing edge in this competitive marketplace.

So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to do your own forecasts and add links to them in the comment section below. You can use my data or follow the detailed blog at Librestats to collect your own. One thing is certain: the coming decade in the field of analytics will be interesting indeed!

SAS, SPSS, Stata Users: Learn R from Home June 17

Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On June 17 and 19, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need.

A complete outline of the workshop plus a registration link is here. I have no artistic skills, but I’ve always been amazed at what artists can do. I taught this workshop in Knoxville on April 29, and pro photographer Steve Chastain made it look way more exciting than I recall! His view of it is here; turn your speakers up and get ready to boogie!

Knoxville R Users Group Formed, Free Training Offered

R is popular free and open-source software for graphics and data analytics. The Knoxville R Users Group is being formed to help people learn R and improve their skills with it. Three departments of The University of Tennessee are working together to get it started: the Office of Information Technology, the National Institute for Computational Science’s RDAV group (Remote Data Analysis and Visualization) and the Department of Statistics, Operations, and Management Science. The latter’s Business Analytics program was recently ranked among the top 20 such departments in the U.S.

To start the group off I’ll teach a hands-on introductory workshop on R on April 29th from 8 a.m. to 5:00 p.m. The topics covered are described at http://r4stats.com/workshops/r4sas-spss-stata/. Note that you do not need to know SAS, SPSS or Stata, but the workshop will include numerous warnings where R works very differently from those other packages. The workshop is free and open to the Knoxville area public. UT faculty, staff and students can register at: http://oit.utk.edu/training and non-UT people can register at the user group web site: http://www.meetup.com/Knoxville-R-Users-Group. Course location, materials including slides, programs, practice data sets and exercises will be available on http://www.meetup.com/Knoxville-R-Users-Group on Saturday, April 27 (if not before).

R Tackles Big Garbage

April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.

While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.

While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”

Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”

The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”

R’s 2012 Growth in Capability Exceeds SAS’ All Time Total

by Robert A. Muenchen

I’m slowly gathering all the data needed to update my ongoing article, The Popularity of Data Analysis Software. The section below is the latest installment.

Growth in Capability

The capability of all the software in this article has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.

Figure 10 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.995). Early version numbers of R increase by 0.10 while more recent ones increased by 0.01. To make the x-axis consistent, the graph displays simply the numerical order in which the versions were released. The right-most point is for version 2.15.2, the last version released in 2012.

: Figure 10. Number of R packages plotted for each major release of R. The last value on the x-axis represents version 2.15.2, the final release in 2012.

As rapid as this growth has been, the data in Figure 10 represents only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 3/19/2013 counted 6,275 R packages at all major repositories, 4,315 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 30% higher on the y-axis than the one shown in Figure 10. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In its most recent version, 9.3, SAS offers 100 programming statements, 258 procedures (Base, STAT, ETS, Graph, HP Forecasting, Macro, OR, QC) and 520 SAS functions and call routines, and 314 IML statements, functions and subroutines for a total of 1,192 items that are roughly equivalent to R functions. R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 31,375 functions compared to SAS’ 1,192. In fact, during 2012 alone, R added more functions/procs than SAS Institute has provided in its entire history! That’s 701 packages, counting only CRAN, or around 3,505 new functions in 2012.

Of course these R functions and SAS procedures / functions are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, giving them potentially more output per command. However, R functions can nest inside one another, creating nearly infinite combinations of output. While the comparison is not perfect, it is certainly an eye opener.

Stay tuned for future updates which will include what employers are now advertising for and recent trends in academic use of analytic software.

What Analytic Software are People Discussing?

by Robert A. Muenchen

How can we measure the popularity or market share of analytic software? One way is to see what people are discussing. I’m in the process of updating my annual article, The Popularity of Data Analysis Software. Below is the newly updated Internet Discussion section. Don’t bother to read the rest of the main article unless you’re in a hurry. I’ve been collecting data for several of the other more interesting plots and will have more to report in following posts. As always, I’m very interested in getting feedback. If you know of other discussion forums that I can collect data on without too much effort, please let me know. Internet Discussion There are some stable and objective measures regarding analytic software. Schwartz (2009) suggested estimating relative popularity by plotting the amount of email discussion devoted to each. The most widely used packages all have discussion lists, or “listservs” devoted to them. The less popular ones either do not have such discussions or, like the lists for Minitab or S-PLUS, may have only a dozen or so emails per year. Some software packages have multiple discussion lists. For example, there are 21 devoted to using R for various focused areas such as graphics, mapping, ecology, epidemiology, etc. (http://www.r-project.org/mail.html). A broader list, including a version of R-Help in Spanish, lists 49 discussions (https://stat.ethz.ch/mailman/listinfo). Figure 1a shows the level of activity on only each main discussion listserv in a typical month (i.e. forums, news groups and Google groups are excluded). Each point represents the sum of the 12 monthly counts that occurred in that year. This plot contains data through the end of 2012. If you read this article in previous years, this plot used to display the mean number of emails per month rather than the sum. Therefore the scale of the y-axis is different but the relative locations of the points are virtually identical. I made this change to enable better a better comparison to discussion forums (e.g. Fig. 1b).

: Figure 1a. Sum of monthly email traffic on each software’s main listserv discussion list.

We can see that discussion of R has grown the most rapidly and, for the past few years, R is the most discussed software by an almost two-to-one margin. In recent years, it is followed by Stata, SAS and SPSS, respectively. Stata showed steady discussion growth until it passed SAS in 2010. SAS saw rapid growth in its discussion until 2006 when it leveled off and then declined. That decline coincided with the strong growth of both R and Stata, offering competition to SAS. SPSS held steady at a low rate across the time frame, which may be attributable to its great ease of use relative to the other packages. With both the interface and the documentation aimed at people who prefer GUIs over programming, there’s less need to ask how to do variations on an analysis. In fact, there’s less ability to do such variations. As a result, I doubt SPSS’ low showing in this graph is indicative of its popularity or market share. It would be interesting to see what topics were most discussed on each list. The only such analysis of which I am aware was done by Arthur Tabachnek (2010) for the SAS list. The most popular topic in 2009 turned out to be…R! You can read his full analysis here under slides from the 2010 session. In the last year or two, R and Stata joined SAS in the decline in listserv discussion. Given the sharp increase in the popularity of business analytics, Big Data, and so on, it is unlikely that people are using or talking about these tools less. Instead, alternative forums of discussion have appeared. The site Stack Overflow (http://stackoverflow.com) covers a wide range of programming and statistical topics, while its sister site, Cross Validated (http://stats.stackexchange.com/), focuses only on statistical analysis. A third site, Talk Stats (http://www.talkstats.com), also focuses on statistical analysis. At all three sites, users tag their topics making it particularly easy to focus searches. Figure 1b shows the software people are discussing there.

: Figure 1b. Number of posts per software on each forum on 2/10/2013.

We can see that the discussion of R is dramatically higher than the other packages, which don’t differ very much. Much of this difference is due to the influence of Stack Overflow, reflecting the vastly greater popularity of R as a programming language. However, even removing that effect, it is easy to see that R still dominates the discussions on the more statistically-oriented forums. This data is cumulative, but it would be very interesting to see how it grew by year. Without access to such data, at least we have the data in Fig. 1a to give us a feel for history.

Other popular discussion forum sites are LinkedIn.com and Quora.com. Neither of these sites make it easy to count number of posts, but they do display the number of people who have joined discussion groups (Figure 1c).

: Figure 1c. Number of people registered in the main discussion group for each software.

In Figure 1c we get a better view of corporate software use. I do not know the ratio of corporate to academic use of LinkedIn, but among the academics I do know (quiet a few) they use it very little. In this world, SAS is the leader with R close behind. It’s interesting to see SPSS with a 50% lead over Stata; it was also slightly higher in Fig. 1b. Remember these are people who have joined a group, not necessary people who are talking as the previous two figures were. Still, group membership should be a reasonable proxy for popularity or market share. In the coming weeks, I’ll be updating the data on which software scholars are using, the growth of R packages and what skills employers are seeking in their new hires.

Comparing Transformation Styles: attach, transform, mutate and within

There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!

Here’s a data frame with one variable:

> mydata <- data.frame(x = 1:5)
> mydata

  x
1 1
2 2
3 3
4 4
5 5

Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. The simplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:

> mydata.new <- mydata

> mydata.new$x2 <- mydata.new$x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

That works, but I had to type more characters for the “mydata.new” part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use the attach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x” instead of “mydata$x”. However, the attach function is tricky to use. Here’s the most common mistake made by beginners:

> mydata.new <- mydata
> attach(mydata.new)
> x2 <- x  ^ 2 
> x3 <- x2 + 100

> mydata.new
  x
1 1
2 2
3 3
4 4
5 5

There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:

> ls()
[1] "mydata" "mydata.new" "x2" "x3"

> x2; x3

[1]  1  4  9 16 25
[1] 101 104 109 116 125

I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detach mydata.new so we can start fresh.

> rm(x2, x3)
> detach(mydata.new)

We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:

> mydata.new <- mydata
> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- x2 + 100
Error: object 'x2' not found

> detach(mydata.new)

The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2” you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:

> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

> detach(mydata.new)

That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.

The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.

> mydata.new <- transform(mydata, x2 = x ^ 2)

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-”.

Eliminating the tedious repetition of “mydata$…” makes the formulas easier to enter, read and debug. However, the transform function has a problem: it is unable to use a variable that it just created. For example:

> mydata.new <- transform(mydata,
+                 x2 = x  ^ 2,
+                 x3 = x2 + 100
+                 )

Error in eval(expr, envir, enclos) : object 'x2' not found

We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.

In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.

Hadley Wickham’s dplyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:

> library("dplyr")
> mydata.new <- mutate(mydata,
+                 x2 = x ^ 2,
+                 x3 = x2 + 100
+                 )

> mydata.new 

x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:

> mydata.new <- mutate(mydata,
+                 x2 = x  ^ 2, 
+                 x2 = x2 + 100
+                 )

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using the ifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)

Finally, we come to the within function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:

> mydata.new <- within(mydata, {
+              x2 <- x  ^ 2
+              x3 <- x2 + 100
+              } )

> mydata.new

  x  x3 x2
1 1 101  1
2 2 104  4
3 3 109  9
4 4 116 16
5 5 125 25

Notice that we’re back to using the assignment operator “<-” and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.

When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:

> mydata.new <- within(mydata, {
+               x2 <- x  ^ 2
+               x2 <- x2 + 100
+               } )

> mydata.new

  x  x2
1 1 101
2 2 104
3 3 109
4 4 116
5 5 125

Since the within function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!

That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.

R for SAS, SPSS, Stata Users Workshop Redesigned

My workshop R for SAS, SPSS and Stata Users has been popular over the years, but it’s time for an overhaul. A common request has been to simplify it, so I have moved data management to a separate 4-hour workshop, Managing Data with R. This makes it much easier to absorb the basics in the remaining two 4-hour sessions. When you’re ready for more, you can take the other workshop which I’ll be offering several time per year. Detailed course outlines are available at the workshop links above and at the Revolution Analytics web site.

Specifying Variables in R

R has several ways to specify which variables to use in an analysis. Some of the most frustrating errors can result from not understanding the order in which R searches for variables. This post demonstrates that order, hopefully smoothing your future use of R.

If all your variables are vectors in your workspace, using them in an analysis is easy: simply name them. For example, you could build a linear model (regression) using the lm function like this:

lm(y ~ x)

However, data frames exist for a good reason. They help organize variables and keep the values of each observation (the rows) locked together. For example, when you sort a data frame, all the rows of a data frame are moved, not just the single variable you’re sorting on. Once variables are stored in a data frame however, referring to them gets more complicated. R can include variables from multiple places (e.g. two data frames or a data frame and the workspace) so it becomes important to know your options and how R views them.

You can specify the names of both a data frame and a variable using the compound forms mydata$myvar or mydata[“myvar”]. However, that often means that you have to type the name of the data frame quite a lot.

If you use the form “with(mydata,…” then R will look in that data frame for the “short” variable names before it looks elsewhere, like in your workspace. That allows you to type the data frame name only once per function call, but in a long program you would still end up typing it a lot.

Modeling functions in R often let you specify “data = mydata” allowing you to use short variable names in formulas like “y ~ x”. The result is like the “with” function, you must type the data frame name once per function call. (SAS users take note: variables used outside of formulas will not be found with this approach!)

Finally, you can attach the data frame with “attach(mydata)”. This copies the variables into a temporary space that lets you then refer to them by their short names. This has the big advantage of allowing all the following function calls to use short variable names. Unfortunately, it has the big disadvantage of being confusing. Confusion #1 is that people feel that variables they create will go into the data frame automatically; they will not. Unless you specify a data frame using either mydata$newvar or mydata[“newvar”], new variables are created in your workspace. Confusion #2 is that R will look in your workspace before it looks at the attached versions of variables. So if variables with the same names exist there, those will be used instead. Confusion #3 is that even though detach(mydata) will reverse the process, if you run your program multiple times, you may have attached the data multiple times and detaching once does not fully undo the attached state. As confusing at that is, I use attach frequently and rarely get burned by it.

For example, with variables x and y stored in mydata (and nowhere else) you could do a linear regression model using any one of these approaches:

lm(mydata$y ~ mydata$x)

lm(mydata[“y”] ~ mydata[“x”])

with(mydata, lm(y ~ x))

lm(y ~ x, data = mydata)

attach(mydata)
lm(y ~ x)

As if that weren’t complicated enough, both x and y do not have to both be in the same data frame! The x variable could be in mydata and the y variable could be in the workspace or in an attached version of mydata or some other data frame. That would be dangerous, of course, since it would be up to you to ensure that the values of each observation match or the resulting model would be nonsense. However, this kind of flexibility can also be very useful.

With all this flexibility, it’s important to know the order in which R chooses variables. A simple example can show us the order R uses. Here I am creating four data frames whose x and y variables will have a slope that is indicated by the data frame name. For example, the variables in df10 have a slope of 10. This will make it easy for us to see which version of the variables R is using.

> y <- c(1,2,3,4,5,6,7,8,9,10)
> x <- c(1,2,5,5,5,5,5,8,9,10)
> df1    <- data.frame(x, y)     
> df10   <- data.frame(x, y = y*10  )
> df100  <- data.frame(x, y = y*100 )
> df1000 <- data.frame(x, y = y*1000)
> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Notice that I have deleted the original x and y variables so at the moment, varibles x and y exist only within the data frames. Running a regression with lm(y ~ x) will not work since R does not look into data frames unless you tell it to. Even if it did, it would have no way to know which set of x’s and y’s to use. Next I will take two different approaches to “selecting” a data frame. I attach df1 and copy the variables from df10 into the workspace.

> attach(df1)
> y <- df10$y
> x <- df10$x

Next, I do something rarely useful, calling a linear model using both “with” and “data=”. Which will dominate?

> with(df100, lm(y ~ x, data = df1000))

Call:
lm(formula = y ~ x, data = df1000)

Coefficients:
(Intercept)            x  
          0         1000

Since the slope is 1000, it’s clear that the “data=” argument was dominant. So R would look there first. If it found both x and y, it would stop looking. But if it only found one variable, it would continue to look elsewhere for the other. If the other variable where in the “with” data frame, it would then use it.

Next I’ll remove the “data” argument and see what happens.

> with(df100, lm(y ~ x))

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0          100

This time the “with” data frame was used for both variables. If variable either had not been in that data frame, R would have continued to look in the workspace and in the attached copy. But which would it use first? Next, I’m not specifying a data frame at all.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0           10

The slope of 10 tells us that it found the copies of x and y that I copied from df10 into the workspace. Let’s delete those variables and list the objects in our workspace to ensure that they’re gone.

> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Both x and y are clearly gone. So lets see if we can still use them.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0            1

We deleted x and y but we can still use them! However, we see from the slope of 1 that R has used a different pair of x and y variables. They’re the ones that were copied to my search path when I used “attach(myDf1)”. I had to remember that I had attached them. It’s this kind of confusion that makes many R users avoid using attach. Finally, I’ll detach df1 and see what happens.

> detach(df1)
> lm(y ~ x)
Error in eval(expr, envir, enclos) : object 'y' not found

Now, even though all the data frames in our workspace contain an x and y variable, R does not look inside to find any of them. Even if it did, it would have no way of know which to choose.

We have seen that R looks in various places for variables. In order, they are: what you specify in “data=”, using “with(mydata,…”, your workspace and finally attached copies of your data frame. The most recently attached copies are the ones it will use first. I hope this will help you use R with both less typing and less confusion.

SAS Beats R on July 2012 TIOBE Rankings

The TIOBE Community Programming Index ranks the popularity of programming languages, but from a programming language perspective rather than as analytical software (http://www.tiobe.com). It extracts measurements from blogs, entries in Wikipedia, books on Amazon, search engine results, etc. and combines them into a single index. The July 2012 rankings place SAS in 24th place and R in 28th. This is a reversal from the January rankings, which had R in 24th place and SAS at 31st.

The Transparent Language Popularity Index is very similar to the TIOBE Index except that, as you might guess, its ranking software, algorithm and data are published for all to see. I didn’t find this index until July of 2012 at which time it ranks R in 12th place and SAS in 25th.

I have updated this information in my ongoing article, The Popularity of Data Analysis Software.