There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!
Here’s a data frame with one variable:
> mydata <- data.frame(x = 1:5) > mydata x 1 1 2 2 3 3 4 4 5 5
Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. The simplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:
> mydata.new <- mydata
> mydata.new$x2 <- mydata.new$x ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100
> mydata.new
x x2 x3
1 1 1 101
2 2 4 104
3 3 9 109
4 4 16 116
5 5 25 125
That works, but I had to type more characters for the “mydata.new” part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use the attach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x” instead of “mydata$x”. However, the attach function is tricky to use. Here’s the most common mistake made by beginners:
> mydata.new <- mydata > attach(mydata.new) > x2 <- x ^ 2 > x3 <- x2 + 100 > mydata.new x 1 1 2 2 3 3 4 4 5 5
There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:
> ls() [1] "mydata" "mydata.new" "x2" "x3" > x2; x3 [1] 1 4 9 16 25 [1] 101 104 109 116 125
I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detach mydata.new so we can start fresh.
> rm(x2, x3) > detach(mydata.new)
We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:
> mydata.new <- mydata
> attach(mydata.new)
> mydata.new$x2 <- x ^ 2
> mydata.new$x3 <- x2 + 100
Error: object 'x2' not found
> detach(mydata.new)
The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2” you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:
> attach(mydata.new) > mydata.new$x2 <- x ^ 2 > mydata.new$x3 <- mydata.new$x2 + 100 > mydata.new x x2 x3 1 1 1 101 2 2 4 104 3 3 9 109 4 4 16 116 5 5 25 125 > detach(mydata.new)
That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.
The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.
> mydata.new <- transform(mydata, x2 = x ^ 2) > mydata.new x x2 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25
Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-”.
Eliminating the tedious repetition of “mydata$…” makes the formulas easier to enter, read and debug. However, the transform function has a problem: it is unable to use a variable that it just created. For example:
> mydata.new <- transform(mydata,
+ x2 = x ^ 2,
+ x3 = x2 + 100
+ )
Error in eval(expr, envir, enclos) : object 'x2' not found
We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.
In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.
Hadley Wickham’s dplyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:
> library("dplyr") > mydata.new <- mutate(mydata, + x2 = x ^ 2, + x3 = x2 + 100 + ) > mydata.new x x2 x3 1 1 1 101 2 2 4 104 3 3 9 109 4 4 16 116 5 5 25 125
However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:
> mydata.new <- mutate(mydata, + x2 = x ^ 2, + x2 = x2 + 100 + ) > mydata.new x x2 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25
As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using the ifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)
Finally, we come to the within function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:
> mydata.new <- within(mydata, { + x2 <- x ^ 2 + x3 <- x2 + 100 + } ) > mydata.new x x3 x2 1 1 101 1 2 2 104 4 3 3 109 9 4 4 116 16 5 5 125 25
Notice that we’re back to using the assignment operator “<-” and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.
When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:
> mydata.new <- within(mydata, { + x2 <- x ^ 2 + x2 <- x2 + 100 + } ) > mydata.new x x2 1 1 101 2 2 104 3 3 109 4 4 116 5 5 125
Since the within function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!
That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.
Thanks Bob. Your post is a great reminder to look out for the base functions in R, which I keep forgetting, such as ‘within’.
I used R for several years before I fully grasped its usefulness. Another really useful thing to do with either “within” or “mutate” is set your factors, assigning value labels, etc.
Cheers, Bob
One problem with within is that the order of the newly added variables is unspecified – in fact I believe it has changed in the last few versions when by default environments now use hash tables. By using named arguments to pass the expressions you get an explicit ordering, but have the limitation you mentioned that you can’t assign to the same name twice.
I like the fact that with “within” all your R syntax is exactly what it would be if all variables were separate vectors outside the data frame. It’s too bad the order gets flipped like that though. Thanks for the explanation!
Bob
Bless you and posts like this.
I’m glad you found it useful. You might also like “Specifying Variables in R” at http://r4stats.com/2012/09/25/specifying-variables-in-r/. The title sounds like you already know the topic, but I’ll bet some of it surprises you. It’s amazing how such basic topics as these can trip you up.
Cheers,
Bob
This a very useful, because if you struggle with the supposedly easy stuff, the more difficult can be daunting.
Someone said, “R makes hard things easy, and easy things hard” and there is definitely some truth to that. When I started in R, complex analyses seemed relatively easy. I kept getting tripped up on the “easy” parts like this.
Cheers,
Bob
How do you write “between 0 and 15” in R
Also, how do you write “between 15 to 50” in R
Thank you.
Thank you Bob for such nice explanation.
Hi Loneharoon,
I’m glad you found it useful. Your comment reminds me to update it to mention that mutate is also in Hadley’s newer (and much better) dplyr package.
Cheers,
Bob