Learn R and/or Data Management from Home January or April

R--67

If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in January and April.

If you already know another analytics package, the workshop, Intro to R for SAS, SPSS and Stata Users may be for you. With each R concept, I’ll introduce it using terminology that you already know,  then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later, or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. You can see a complete out line and register for the workshop starting January 13 (click here) or April 21 (click here).

If you already know R, but want to learn more about how you can use R to get your data ready to analyze, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. The course outline and registration is available here for January and here for April.

If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.

Learn R and/or Data Management from Home October 7-11

R--67

If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in October.

If you already know another analytics package, the workshop, Intro to R for SAS, SPSS and Stata Users may be for you. With each R concept, I’ll introduce it using terminology that you already know,  then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later, or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. You can see a complete out line and register for the workshop here.

If you already know R, but want to learn more about data management, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. That course outline and registration is here.

If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.

 

Webinar: Managing Data with R

Before you can analyze data, it must be in the right form. Join Revolution Analytics and me this June 21st for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr and lubridate.

R--143

Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!

Comparing Transformation Styles: attach, transform, mutate and within

There are several ways to perform data transformations in R. Each has its own set of advantages and disadvantages. Let’s take one variable, square it and add 100. How many ways might an R beginner screw up such a simple computation? Quite a few!

Here’s a data frame with one variable:

> mydata <- data.frame(x = 1:5)
> mydata

  x
1 1
2 2
3 3
4 4
5 5

Since the variable x exists only in mydata, to transform x, I must somehow tell R it is stored in mydata. The simplest way to do that is using dollar format: mydata$x. I’ll make a copy of the data first so we can do the transformation several ways:

> mydata.new <- mydata

> mydata.new$x2 <- mydata.new$x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

That works, but I had to type more characters for the “mydata.new” part than I did for the transformation itself. So let’s look at approaches that save us that trouble. One widely used approach is to use the attach function. This function makes a copy of a data frame’s variables in a temporary area that is attached to your search path as separate variables or vectors. That’s nice because you can refer to them simply by their names like “x” instead of “mydata$x”. However, the attach function is tricky to use. Here’s the most common mistake made by beginners:

> mydata.new <- mydata
> attach(mydata.new)
> x2 <- x  ^ 2 
> x3 <- x2 + 100

> mydata.new
  x
1 1
2 2
3 3
4 4
5 5

There are no error messages, but the variables are not in the data frame! The attach function allows you to use short names to refer to variables in a data frame, but it does not change where new variables are written. So x2 and x3 are simply in my workspace:

> ls()
[1] "mydata" "mydata.new" "x2" "x3"

> x2; x3

[1]  1  4  9 16 25
[1] 101 104 109 116 125

I’ll fix that, but first I’ll remove x2 and x3 from the workspace and detach mydata.new so we can start fresh.

> rm(x2, x3)
> detach(mydata.new)

We can fix this problem by directing new variables into the data frame using dollar format. So here’s the next thing a beginner is likely to try:

> mydata.new <- mydata
> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- x2 + 100
Error: object 'x2' not found

> detach(mydata.new)

The variable x2 got created and put into mydata.new. However, when the attempt to create x3 was run, variable x2 could not be found. This is due to the fact that the attached version of the data is a copy that was done in the past, it is not a live connection. Therefore, to refer to simply “x2” you would have to attach mydata.new again. You could also get around this problem by using dollar format in the second equation:

> attach(mydata.new)

> mydata.new$x2 <- x  ^ 2
> mydata.new$x3 <- mydata.new$x2 + 100

> mydata.new

  x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

> detach(mydata.new)

That worked, but having to keep track of when you do and don’t need dollar format seems more trouble than it’s worth. In addition, the fact that attach actually makes a copy of the data means that it wastes both time and memory.

The transform function lets you use short variable names on both sides of the equation, and it does not need to make a copy of the data set. Let’s just square x to see how it works.

> mydata.new <- transform(mydata, x2 = x ^ 2)

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

Notice that when calling the transform function, new variable names like x2 are actually the names of arguments, and the formulas are the values of those arguments. As a result, the equals sign is used instead of the assignment operator “<-”.

Eliminating the tedious repetition of “mydata$…” makes the formulas easier to enter, read and debug. However, the transform function has a problem: it is unable to use a variable that it just created. For example:

> mydata.new <- transform(mydata,
+                 x2 = x  ^ 2,
+                 x3 = x2 + 100
+                 )

Error in eval(expr, envir, enclos) : object 'x2' not found

We see that when attempting to create x3 from x2, the variable x2 is not found. It will not exist until the call to transform is complete. In our simple example, x2 may be merely an intermediate step, and we could avoid this problem by calculating x3 directly with one formula: x3 = (x ^ 2) + 100. However, if we really need x2 to exist later as a variable, we would have to run transform twice, once to create x2 and again to create x3 from it.

In the above code, note the comma between the two equations. Since transform uses equations as the values of tranform’s arguments, all equations must be followed by commas, except for the last one, which is followed by the final close parenthesis.

Hadley Wickham’s dplyr package has a very useful function, mutate. It’s very similar to the base transform function but it can use variables that it just created:

> library("dplyr")
> mydata.new <- mutate(mydata,
+                 x2 = x ^ 2,
+                 x3 = x2 + 100
+                 )

> mydata.new 

x x2  x3
1 1  1 101
2 2  4 104
3 3  9 109
4 4 16 116
5 5 25 125

However, mutate does have a limitation: it cannot re-create a variable that it just created. So you can use its new variables only on the right-hand side of your equations. In this next example, rather than create x3, I’ll continue to use the name x2:

> mydata.new <- mutate(mydata,
+                 x2 = x  ^ 2, 
+                 x2 = x2 + 100
+                 )

> mydata.new

  x x2
1 1  1
2 2  4
3 3  9
4 4 16
5 5 25

As you can see, mutate kept only the first transformation to x2, ignoring the addition of 100. You might think that reusing the same variable name would be a rare occurrence, but if you are recoding a variable using the ifelse function (albeit inefficiently) this situation can arise often. (Avoid that by nesting multiple calls to ifelse, which is also more efficient.)

Finally, we come to the within function. It uses variables by their short names, saves new variables inside the data frame using short names, and it allows you to use new variables anywhere in calculations. It is built into base R, and it works like this:

> mydata.new <- within(mydata, {
+              x2 <- x  ^ 2
+              x3 <- x2 + 100
+              } )

> mydata.new

  x  x3 x2
1 1 101  1
2 2 104  4
3 3 109  9
4 4 116 16
5 5 125 25

Notice that we’re back to using the assignment operator “<-” and commas are not used between formulas. Multiple formulas must be enclosed in {braces}. Also note that the variables appear in the data frame in reverse order. Variable x3 appears before x2, even though the formula for x2 appeared first.

When I reuse the variable name x2 rather than create a new variable, x3, I still get the right answer:

> mydata.new <- within(mydata, {
+               x2 <- x  ^ 2
+               x2 <- x2 + 100
+               } )

> mydata.new

  x  x2
1 1 101
2 2 104
3 3 109
4 4 116
5 5 125

Since the within function does this example so well, why use anything else? The mutate function shares syntax with dplyr’s summarise function and their combination provides great flexibility when doing transformations or getting summary statistics by groups. Because of this, I use mutate to do this type of task and remember to not transform a variable that I just created!

That covers the main ways to transform variables in R. I hope that by understanding the limitations of each, you’ll avoid common pitfalls and be a more productive R user.