Knoxville R User’s Group Meeting November 1

The next meeting of the Knoxville R User’s Group will consist of four 20-minute talks followed by an open planning session. It will take place on Friday, November 1, from 2:00 p.m. to 4:00 p.m. at The University of Tennessee, Haslam Business Administration Building, room 403 (1000 Volunteer Blvd., Knoxville, TN). RSVP at http://www.meetup.com/Knoxville-R-Users-Group. The topics and biographical information regarding the speakers are listed below.

Automated Forecasting using R: A Stock Market Example (2:00-2:20)

R’s forecast package can be used to generate automated ARIMA model forecasts in a method comparable to SAS Forecast Server. This talk will demonstrate how to use the R ‘quantmod’ package to query financial data from Yahoo finance and then utilize the data in the forecast package to automatically produce point forecasts and prediction intervals. Examples of how to use each package, including diagnostic plots and results, will be included.

Josh Price earned a BS and MS in statistics, both from the University of Tennessee.  While working on his Master’s, he worked as a graduate assistant for Research Computing Support.  After graduating, Josh worked for 7 years in industry as a consultant in both business and engineering. In January 2013, he returned to UT to work as a statistical consultant where he assists students, faculty, and staff with statistical aspects of their theses, dissertations and various research projects.  Josh’s current interests include programming, forecasting methods, and quantitative finance.

BioGeoBEARS: An R package for inference and model testing in historical biogeography (2:20-2:40)

Phylogenetic biogeography is traditionally concerned with the inference of ancestral geographic ranges on a phylogeny, and of inferring the history of events that lead to present-day distributions. The field has been dominated for decades by debates about whether vicariance or dispersal is the dominant process. This talk will demonstrate, using BioGeoBEARS, that assumptions about the processes can be subject to statistical inference from the data, and show that founder-event speciation is a crucial process that has been left out of the current biogeography programs DIVA, LAGRANGE, and BayArea.

Nicholas J. Matzke is a Postdoctoral Fellow in Mathematical Biology at the National Institute for Mathematical and Biological Synthesis (NIMBioS, www.nimbios.org)) at UT Knoxville, and a member of Brian O’Meara’s lab in the Department of Ecology and Evolutionary Biology. He is also the author of the BioGeoBEARS package.

Elevating R to Supercomputers (2:40-3:00)

The biggest supercomputing platforms in the world are distributed memory machines, but the overwhelming majority of the development for parallel R infrastructure has been devoted to small shared memory machines. Additionally, most of this development focuses on task parallelism, rather than data parallelism. But as big data analytics becomes ever more attractive to both users and developers, it becomes increasingly necessary for R to add distributed computing infrastructure to support this kind of big data analytics which utilize large distributed resources.  The Programming with Big Data in R (pbdR) project aims to provide such infrastructure, elevating the R language to these massive-scale computing platforms.  This talk will cover some of the early successes of the pbdR project, benchmarks, challenges, and future plans.

Drew Schmidt is a researcher at the University of Tennessee’s National Institute for Computational Sciences, and is primarily interested in the intersection of mathematics, statistics, and high-performance computing.  He is co-lead developer of the Programming with Big Data in R (pbdR) project, which elevates the statistics programming language R to large distributed computing platforms.

BREAK (3:00-3:10)

Analyzing Data by Group Using R’s plyr Package (3:10-3:30)

A common data analysis task is repeating the analysis for groups within your data set. In most analytics software, this is made trivial by the addition of a single statement, such as SAS’ “BY GROUP”.  However, in R you must write a function and apply it by group. That function can be simple if you’re simply looking to print the results. However, if you wish to analyze those results further, you may need a series of function to apply. We’ll go over an example of each case, showing why it goes so quickly from simple to complex. This talk will use various tools from the popular plyr package to apply the functions.

Bob Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob is an Accredited Professional Statistician™ with 32 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations.  He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

Quo Vadis KRUG?  (3:30-4:00)

The Knoxville R User’s Group, or KRUG, started off with a series of workshops but it’s well past time to discuss where KRUGgers would like to take it. How often should we meet?  How long should the talks be?  Is the Friday afternoon timeslot good?  Is meeting at UT sufficient, or should we move the meeting around (anyone have space?)  Everything is up for discussion, so we’ll devote this final session to mull it over.

What R Has Been Missing

While R has more methods than any other analytics software, it has been missing a crucial feature found in most other packages. SPSS Modeler had it first, way back when they still called it Clementine. Then SAS Institute realized how crucial it was to productivity and added it to Enterprise Miner. As its reputation spread, it was added to RapidMiner, Knime, Statistica, Weka, and others. An early valiant attempt was made to add it to R. What is it?  It’s the flowchart-style graphical user interface. And it will soon be available in R at last.

While menu-driven interfaces such as R Commander, Deducer or SPSS are somewhat easier to learn, the flowchart interface has two important advantages. First, you can often get a grasp of the big picture as you see steps such as separate files merging into one, or several analyses coming out of a particular data set (see figure). Second, and more important, you have a precise record of every step in your analysis. This allows you to repeat an analysis simply by changing the data inputs. Instead, menu-driven interfaces require that you switch to the programs that they create in the background if you need to automatically re-run many previous steps. That’s fine if you’re a programmer, but if you were a good programmer, you probably would not have been using that type of interface in the first place!

Alteryx GUI Screen Shot
Alteryx’ flowchart user interface, soon to be added to Revolution R Enterprise.

This week Revolution Analytics and Alteryx announced that future versions of Revolution R Enterprise will include Alteryx’ flowchart-style graphical user interface. Alteryx has traditionally focused on the analysis of spatial data, only adding predictive analytics in 2012 (skip 37 minutes into this presentation.)  This partnership will also allow them to add Revolution’s big data features to various Alteryx products. Both companies are likely to get a significant boost in sales as a result.

While I expect both companies will benefit from this partnership, they could do much better. How? By making the Alteryx interface available for the community (free) version of R. If most R users were familiar with this interface, they would be much more likely to choose Alteryx’ tools when they needed them, instead of a competitor’s. When people needed big data tools for R, they’d be more likely to turn to Revolution Analytics. I am convinced that as great as R’s success has been, it could be greater still with a top-quality flowchart user interface that was freely available to all R users. Given the great advantages that this type of interface offers, it’s just a matter of time until a free version appears. The only question is: who will offer it?

[Update: It turns out that Alteryx is already offering a free version that works with the community version of R! See the comment from Dan Putler, product manager and one of the primary developers of Alteryx’s R-based predictive analytics and business data mining tools. I’ll be trying this out and will report my experiences in a future blog post.]