by Bob Muenchen
The recently-created R Consortium consists of companies that are deeply involved in R such as RStudio, Microsoft/Revolution Analytics, Tibco, and others. The Consortium’s goals include advancing R’s worldwide promotion and support, encouraging user adoption, and improving documentation and tools. Those are admirable goals and below I suggest a few specific examples that the consortium might consider tackling.
As I work with various organizations to help them consider migrating to R, common concerns are often raised. With thousands of packages to choose from, where do I start? Do packages go through any reliability testing? What if I start using a package and its developer abandons it? These, and others, are valid concerns that the R Consortium could address.
Choosing Packages
New R users face a daunting selection of thousands of packages. Some guidance is provided by CRAN’s Task Views. In R’s early years, this area was quite helpful in narrowing down a package search. However, R’s success has decreased the usefulness of Task Views. For example, say a professor asks a grad student to look into doing a cluster analysis. In SAS, she’ll have to choose among seven procedures. When considering the Task View on the subject, she’ll be presented with 105 choices in six categories! The greater selection is one of R’s strengths, but to encourage the adoption of R by a wider community it would be helpful to list the popularity of each package. The more popular packages are likely to be the most useful.
R functions are integrated into other software such as Alteryx, IBM SPSS Statistics, KNIME, and RapidMiner. Some are also called from R user interfaces such as Deducer, R Commander, and RATTLE. Within R, some packages depend on others, adding another vote of confidence. The R Consortium could help R users by documenting these various measures of popularity, perhaps creating an overall composite score.
Accuracy
People often ask how they can trust the accuracy (or reliability) of software that is written by a loosely knit group of volunteers, when there have even been notable lapses in the accuracy of commercial software developed by corporate teams [1]. Base R and its “recommended packages” are very well tested, and the details of the procedures are documented in The R Software Development Life Cycle. That set of software is substantial, the equivalent of Base SAS + GRAPH + STAT + ETS + IML + Enterprise Miner (excluding GUIs, Structural Equation Modeling, and Multiple Imputation, which are in add-on packages). Compared to SPSS, it’s the rough equivalent to IBM SPSS Base + Statistics + Advanced Stat. + Regression + Forecasting + Decision Trees + Neural Networks + Bootstrapping.
While that set is very capable, it still leaves one wondering about all the add-on packages. Performing accuracy tests is very time consuming work [2-5] and even changing the options on the same routine can affect accuracy [6]. Increasing the confidence that potential users have in R’s accuracy would help to increase the use of the software, one of the Consortium’s goals. So I suggest that they consider ways to increase the reliability testing of functions that are outside the main R packages.
Given the vast number of R packages available, it would be impossible for the Consortium to perform such testing on all packages. However, for widely used packages, it might behoove the Consortium to use its resources to develop such tests themselves. A web page that referenced Consortium testing, as well as testing from any source, would be helpful.
Package Longevity
If enough of a package’s developers got bored and moved on or, more dramatically, were hit by the proverbial bus, development would halt. Base R plus recommended packages has the whole R Development Core Team backing them up. Other packages are written by employees of companies. In such cases, it is unclear whether the packages are supported by the company or by the individual developer(s).
Using the citation function will list a package’s developers. The more there are, the better chance there is of someone taking over if the lead developer moves on. The Consortium could develop a rating system that would provide guidance along these lines. Nothing lasts forever, but knowing the support level a package has would be of great help when choosing which to use.
Encourage Support and Use of Key Generic Functions
Some fairly new generic functions play a key role in making R easier to use. For example, David Robinson’s broom package contains functions that translate the output of modeling functions from list form into data frames, making output management much easier. Other packages, including David Dahl’s xtable and Philip Leifeld’s texreg, do a similar translation to present the output in nicely formatted forms for publishing. Those developers have made major contributions to R by writing all the methods themselves. The R Consortium could develop a list of such functions and encourage other developers to add methods to them, when appropriate. Such widely applicable functions could also benefit from having the R Consortium support their development, assuring longer package longevity and wider use.
Output to Microsoft Word
R has the ability to create beautiful output in almost any format you would like, but it takes additional work. Its competition, notably SAS and SPSS, let you choose the font and full formatting of your output tables at installation. From then on, any time you want to save output to a word processor, it’s a simple cut & paste operation. SPSS even formats R output to fully formatted tables, unlike any current R IDEs. Perhaps the R Consortium could pool the resources needed to develop this kind of output. If so, it would be a key aspect of their goal of speeding R’s adoption. (I do appreciate the greater power of LaTeX and the ease of use of knitr and Rmarkdown, but they’ll never match the widespread use of Word.)
Graphical User Interface
Programming offers the greatest control over an analysis, but many researchers don’t analyze data often enough to become good programmers; many simply don’t like programming. Graphical User Interfaces (GUIs) help such people get their work done more easily. The traditional menu-based systems, such as R Commander or Deducer, make one-time work easy, but they don’t offer a way to do repetitive projects without relying on the code that non-programmers wish to avoid.
Workflow-based GUIs are also easy to use and, more importantly, they save all the steps as a flowchart. This allows you to check your work and repeat it on another data set simply by updating the data import node(s) and clicking “execute.” To take advantage of this approach, Microsoft’s Revolution R Enterprise integrates into Alteryx and KNIME, and Tibco’s Enterprise Runtime for R integrates into KNIME as well. Alteryx is a commercial package, and KNIME is free and open source on the desktop. While both have commercial partners, each can work with the standard community version of R as well.
Both packages contain many R functions that you can control with a dialog box. Both also allow R programmers to add a programming node in the middle of the workflow. Those nodes can be shared, enabling an organization to get the most out of both their programming and non-programming analysts. Both systems need to add more R nodes to be considered general-purpose R GUIs, but they’re making fairly rapid progress on that front. In each system, it takes less than an hour to add a node to control a typical R function.
The R Consortium could develop a list of recommended steps for developers to consider. One of these steps could be adding nodes to such GUIs. Given the open source nature of R, encouraging the use of the open source version of KNIME would make the most sense. That would not just speed the adoption of R, it would enable its adoption by the large proportion of analysts who prefer not to program. For the more popular packages, the Consortium could consider using their own resources to write such nodes.
Conclusion
The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider? Let’s hear your ideas in the comments section below.
Is your organization still learning R? I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.
Acknowledgements
Thanks to Drew Schmidt and Michael Berthold for their suggestions that improved this post.
References
- Micah Altman (2002), A Review of JMP 4.03 With Special Attention to its Numerical Accuracy, The American Statistician, 56:1, 72-75, DOI: 10.1198/000313002753631402
- D. McCullough (1998), Assessing the Reliability of Statistical Software: Part I, The American Statistician, 52:4, 358-366
- D. McCullough (1999), Assessing the Reliability of Statistical Software: Part II, The American Statistician, 53:2, 149-159
- Kellie B. Keeling, Robert J. Pavur (2007), A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Vol. 51, Issue 8, pp. 3811-3831
- Oluwartotimi O. Odeh, Allen M. Featherstone and Jason S. Bergtold (2010), Reliability of Statistical Software, American Journal of Agricultural Economics,doi: 1093/ajae/aaq068
- Jason S. Bergtold, Krishna Pokharel and Allen Featherstone (2015), Selected Paper prepared for presentation at the 2015 Agricultural & Applied Economics Association and Western Agricultural Economics Association Annual Meeting, San Francisco, CA, July 26-28