Knoxville R Users Group Formed, Free Training Offered

R is popular free and open-source software for graphics and data analytics. The Knoxville R Users Group is being formed to help people learn R and improve their skills with it. Three departments of The University of Tennessee are working together to get it started: the Office of Information Technology, the National Institute for Computational Science’s RDAV group (Remote Data Analysis and Visualization) and the Department of Statistics, Operations, and Management Science. The latter’s Business Analytics program was recently ranked among the top 20 such departments in the U.S.

To start the group off I’ll teach a hands-on introductory workshop on R on April 29th from 8 a.m. to 5:00 p.m.  The topics covered are described at http://r4stats.com/workshops/r4sas-spss-stata/. Note that you do not need to know SAS, SPSS or Stata, but the workshop will include numerous warnings where R works very differently from those other packages. The workshop is free and open to the Knoxville area public. UT faculty, staff and students can register at: http://oit.utk.edu/training and non-UT people can register at the user group web site: http://www.meetup.com/Knoxville-R-Users-Group. Course location, materials including slides, programs, practice data sets and exercises will be available on http://www.meetup.com/Knoxville-R-Users-Group on Saturday, April 27 (if not before).

R Tackles Big Garbage

April 1, 2013 – Although the capabilities of the R system for data analytics have been expanding with impressive speed, it has heretofore been missing important fundamental methods. A new function works with the popular plyr package to provide these missing algorithms. Function names in plyr begin with two letters which indicate their input and output. For example, with the ddply function, the first “d” in its name indicates that a data frame will be read in, and the second “d” indicates that a data frame of results will be written out. Those two letters could also be “a” for array and “l” for list, in any combination.

While the vast array of functions in R cover most data analysis situations, they have been completely unable to handle data that bears no actual relationship to the research questions at hand. Robert A. Muenchen, author of R for SAS and SPSS Users, has written a new ggply function, which can adroitly handle the all too popular “garbage in, garbage out” research situation. The function has only one argument, the garbage to analyze. It automatically performs the analysis strongly preferred by “gg” researchers by splitting numeric variables at the median and performing all possible cross tabulations and chi-square tests, repeated for the levels of all factors. The integration of functions from the new pbdR package allows ggply to handle even Big Garbage using 12,000 cores.

While the median split approach offers the benefit of decreasing power by 33%, further precautions are taken by applying Muenchen’s new Triple Bonferroni with Backpropagation correction. This algorithm controls the garbage-wise error rate by multiplying the p-values by 3k, where k is the number of tests performed. While most experiment-wise adjustment calculations set the worst case p-value to the theoretical upper limit of 1.0, simulations run by Muenchen indicate that this is far too liberal for this type analysis. “By removing this artificial constraint, I have already found cases where the final p-value was as high as 3,287 indicating a very, very, very non-significant result” reported Muenchen. The “backpropogation” part of the method re-scales any p-values that might have survived the initial correction by setting them automatically to 0.06. As Muenchen states, “this level was chosen to protect the researcher from believing an actual useful result was found, while offering hope that achieving tenure might still be possible.”

Reaction from the R community was swift and enthusiastic. Bill Venables, co-author the popular book Modern Applied Statistics in S said, “Muenchen’s new approach for calculating Type III Sums of Squares from chi-squared tests finally puts my mind at ease about using R for statistical analysis.” R programmer extraordinaire Patrick Burns said, “The ggply function is good, but what really excites me is the VBA plugin Bob wrote for Excel. Now I can fully integrate ggply into my workflow.” Graphics guru Hadley Wickham, author of ggplot2: Elegant Graphics for Data Analysis grumbled, “After writing ggplot and ddply, I’m stunned that I didn’t think of ggply myself. That Muenchen fellow is constantly bugging me to add irritating new features to my packages. I have to admit though that this is breakthrough of epic proportions. As they say in Muenchen’s neck of the woods, even a blind squirrel finds a nut now and then.”

The SAS Institute, already concerned with competition from R, reacted swiftly. SAS CEO Jim Goodnight said, “SAS is the leader in Big Data, and we’ll soon catch up to R and become the leader in Big Garbage as well. PROC GGPLY, is already in development. It will be included in SAS/GG, which is, of course, an additional cost product.”