Two new R packages are quickly becoming standards in the R community:

Hadley Wickham’s dplyr and tidyr. The dplyr package almost completely replaces his popular plyr package for data manipulation. Most importantly for general R use, it makes it much easier to select variables. For example,

R workshop series presented at a major pharmaceutical company. Photography by Stephen Bernard.

if your data included variables for race, gender, pretest, posttest, and four survey items q1 through q4, you could select various sets of variables using:

**library("****dplyr****")
****select(****mydata****, race, ****gender) # Just those two variables.
****select(****mydata****, ****gender:posttest) # From gender through posttest.
****select(****mydata****, contains("test****")) # Gets pretest & posttest.
****select(****mydata****, ****starts_with****("q****")) # Gets all vars staring with "q".
****select(****mydata****, ****ends_with****("test")) # All vars ending with "test".
****select(****mydata****, ****num_range****("q", ****1:4)) # q1 thru q4 regardless of location.
select(****mydata****, matches("^q****")) # Matches any regular expression.**

As I show in my books, these were all possible in R before, but they required much more programming.

The tidyr package replaces Hadley’s popular reshape and reshape2 packages with a data reshaping approach that is simpler and more focused just on the reshaping process, especially converting from “wide” to “long” form and back.

I’ve integrated dplyr in to my workshop R for SAS, SPSS and Stata Users, and both tidyr and dplyr now play extensive roles in my Managing Data with R workshop. The next Virtual Instructor-led Classroom (webinar) version of those workshops I’m doing in partnership with Revolution Analytics during the week of October 6, 2014. I’m also available to teach them at your organization’s site in partnership with RStudio.com (contact me at Muenchen.bob@gmail.com to schedule a visit). These workshops will also soon be available 24/7 at Datacamp.com. “You’ll be able to take Bob’s popular workshops using an interactive combination of video and live exercises in the comfort of your own browser” said Jonathan Cornelissen, CEO of Datacamp.com.

###
Like this:

Like Loading...

*Related*

Tidyr is not a replacement for reshape2. It has the sole purpose of tidying your data. There are no aggregation calculations in tidyr as there is in reshape2. I initially found it confusing when to use each of his packages. After using them for awhile it starts to make sense.

Hi Markus,

Good point. That’s what I meant by saying tidyr is “more focused just on the reshaping process”. I just didn’t want to go into how tidyr and dplyr work together to do the aggregation you referred to using reshape2.

Cheers,

Bob

Hi Bob,

I got large volume of data to deal with (like 1 or 2 G data). So are these two packages optimised for big data in terms of calculation efficiency?

Cheers

Yu

Hi Yu,

If your data is stored in a relational database, dplyr can translate its commands to SQL and send them to the database for execution. If what the commands return will fit into your computer’s memory, you’ll be all set. If not, you might want to work on a random sample of the data. For example, a sample of 10,000 records would give you extremely tight confidence intervals while fitting into the memory of most computers. I’m not sure if tidyr offers that same type of in-database processing or not.

Cheers,

Bob