R Tackles Big Garbage

April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.

While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.

While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”

Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”

The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”

About these ads

About Bob Muenchen

I help researchers analyze their data and write books about research computing.
This entry was posted in Analytics, R, SAS, Statistics, Uncategorized. Bookmark the permalink.

10 Responses to R Tackles Big Garbage

  1. Robert Young says:

    Well, it’s about time!

    • Bob Muenchen says:

      Hi Robert,

      Seeing your blog reminds me of the old saying about third normal form depending on “the key, the whole key and nothing but the key, so help me Codd.”

      Cheers,
      Bob

  2. highgamma says:

    It’s a good day for such an announcement.

  3. carlwitthoft says:

    Everyone’s a comedian :-)

  4. David Carlson says:

    I’m sure Patrick Burns is already working on a 10th Circle devoted to ggply.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s