Goals for the New R Consortium

by Bob Muenchen

The recently-created R Consortium consists of companies that are deeply involved in R such as RStudio, Microsoft/Revolution Analytics, Tibco, and others. The Consortium’s goals include advancing R’s worldwide promotion and support, encouraging user adoption, and improving documentation and tools. Those are admirable goals and below I suggest a few specific examples that the consortium might consider tackling.

R Consortium

As I work with various organizations to help them consider migrating to R, common concerns are often raised. With thousands of packages to choose from, where do I start? Do packages go through any reliability testing? What if I start using a package and its developer abandons it?  These, and others, are valid concerns that the R Consortium could address.

Choosing Packages

New R users face a daunting selection of thousands of packages. Some guidance is provided by CRAN’s Task Views. In R’s early years, this area was quite helpful in narrowing down a package search. However, R’s success has decreased the usefulness of Task Views. For example, say a professor asks a grad student to look into doing a cluster analysis. In SAS, she’ll have to choose among seven procedures. When considering the Task View on the subject, she’ll be presented with 105 choices in six categories!  The greater selection is one of R’s strengths, but to encourage the adoption of R by a wider community it would be helpful to list the popularity of each package. The more popular packages are likely to be the most useful.

R functions are integrated into other software such as Alteryx, IBM SPSS Statistics, KNIME, and RapidMiner. Some are also called from R user interfaces such as Deducer, R Commander, and RATTLE. Within R, some packages depend on others, adding another vote of confidence. The R Consortium could help R users by documenting these various measures of popularity, perhaps creating an overall composite score.

Accuracy

People often ask how they can trust the accuracy (or reliability) of software that is written by a loosely knit group of volunteers, when there have even been notable lapses in the accuracy of commercial software developed by corporate teams [1]. Base R and its “recommended packages” are very well tested, and the details of the procedures are documented in The R Software Development Life Cycle. That set of software is substantial, the equivalent of Base SAS + GRAPH + STAT + ETS + IML + Enterprise Miner (excluding GUIs, Structural Equation Modeling, and Multiple Imputation, which are in add-on packages). Compared to SPSS, it’s the rough equivalent to IBM SPSS Base + Statistics + Advanced Stat. + Regression + Forecasting + Decision Trees + Neural Networks + Bootstrapping.

While that set is very capable, it still leaves one wondering about all the add-on packages. Performing accuracy tests is very time consuming work [2-5] and even changing the options on the same routine can affect accuracy [6].  Increasing the confidence that potential users have in R’s accuracy would help to increase the use of the software, one of the Consortium’s goals. So I suggest that they consider ways to increase the reliability testing of functions that are outside the main R packages.

Given the vast number of R packages available, it would be impossible for the Consortium to perform such testing on all packages. However, for widely used packages, it might behoove the Consortium to use its resources to develop such tests themselves. A web page that referenced Consortium testing, as well as testing from any source, would be helpful.

Package Longevity

If enough of a package’s developers got bored and moved on or, more dramatically, were hit by the proverbial bus, development would halt. Base R plus recommended packages has the whole R Development Core Team backing them up. Other packages are written by employees of companies. In such cases, it is unclear whether the packages are supported by the company or by the individual developer(s).

Using the citation function will list a package’s developers. The more there are, the better chance there is of someone taking over if the lead developer moves on. The Consortium could develop a rating system that would provide guidance along these lines. Nothing lasts forever, but knowing the support level a package has would be of great help when choosing which to use.

Encourage Support and Use of Key Generic Functions

Some fairly new generic functions play a key role in making R easier to use. For example, David Robinson’s broom package contains functions that translate the output of modeling functions from list form into data frames, making output management much easier. Other packages, including David Dahl’s xtable and Philip Leifeld’s texreg, do a similar translation to present the output in nicely formatted forms for publishing. Those developers have made major contributions to R by writing all the methods themselves. The R Consortium could develop a list of such functions and encourage other developers to add methods to them, when appropriate. Such widely applicable functions could also benefit from having the R Consortium support their development, assuring longer package longevity and wider use.

Output to Microsoft Word

R has the ability to create beautiful output in almost any format you would like, but it takes additional work.  Its competition, notably SAS and SPSS, let you choose the font and full formatting of your output tables at installation. From then on, any time you want to save output to a word processor, it’s a simple cut & paste operation. SPSS even formats R output to fully formatted tables, unlike any current R IDEs. Perhaps the R Consortium could pool the resources needed to develop this kind of output. If so, it would be a key aspect of their goal of speeding R’s adoption. (I do appreciate the greater power of LaTeX and the ease of use of knitr and Rmarkdown, but they’ll never match the widespread use of Word.)

Graphical User Interface

Programming offers the greatest control over an analysis, but many researchers don’t analyze data often enough to become good programmers; many simply don’t like programming. Graphical User Interfaces (GUIs) help such people get their work done more easily. The traditional menu-based systems, such as R Commander or Deducer, make one-time work easy, but they don’t offer a way to do repetitive projects without relying on the code that non-programmers wish to avoid.

Workflow-based GUIs are also easy to use and, more importantly, they save all the steps as a flowchart. This allows you to check your work and repeat it on another data set simply by updating the data import node(s) and clicking “execute.” To take advantage of this approach, Microsoft’s Revolution R Enterprise integrates into Alteryx and KNIME, and Tibco’s Enterprise Runtime for R integrates into KNIME as well. Alteryx is a commercial package, and KNIME is free and open source on the desktop. While both have commercial partners, each can work with the standard community version of R as well.

Both packages contain many R functions that you can control with a dialog box. Both also allow R programmers to add a programming node in the middle of the workflow.  Those nodes can be shared, enabling an organization to get the most out of both their programming and non-programming analysts. Both systems need to add more R nodes to be considered general-purpose R GUIs, but they’re making fairly rapid progress on that front. In each system, it takes less than an hour to add a node to control a typical R function.

The R Consortium could develop a list of recommended steps for developers to consider. One of these steps could be adding nodes to such GUIs. Given the open source nature of R, encouraging the use of the open source version of KNIME would make the most sense. That would not just speed the adoption of R, it would enable its adoption by the large proportion of analysts who prefer not to program. For the more popular packages, the Consortium could consider using their own resources to write such nodes.

Conclusion

The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider?  Let’s hear your ideas in the comments section below.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Acknowledgements

Thanks to Drew Schmidt and Michael Berthold for their suggestions that improved this post.

References

  1. Micah Altman (2002), A Review of JMP 4.03 With Special Attention to its Numerical Accuracy, The American Statistician, 56:1, 72-75, DOI: 10.1198/000313002753631402
  2. D. McCullough (1998), Assessing the Reliability of Statistical Software: Part I, The American Statistician, 52:4, 358-366
  3. D. McCullough (1999), Assessing the Reliability of Statistical Software: Part II, The American Statistician, 53:2, 149-159
  4. Kellie B. Keeling, Robert J. Pavur (2007), A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Vol. 51, Issue 8, pp. 3811-3831
  5. Oluwartotimi O. Odeh, Allen M. Featherstone and Jason S. Bergtold (2010), Reliability of Statistical Software, American Journal of Agricultural Economics,doi: 1093/ajae/aaq068
  6. Jason S. Bergtold, Krishna Pokharel and Allen Featherstone (2015), Selected Paper prepared for presentation at the 2015 Agricultural & Applied Economics Association and Western Agricultural Economics Association Annual Meeting, San Francisco, CA, July 26-28

42 thoughts on “Goals for the New R Consortium”

  1. Hi Carson,

    Thanks for the link! Yep, there are lots of ways to get to Word, I’m just longing for a way that’s as easy as SPSS, i.e. I do nothing & the output is perfectly formatted.

    Cheers,
    Bob

  2. Hi Carson,

    Thanks for the link! Yep, there are lots of ways to get to Word, I’m just longing for a way that’s as easy as SPSS, i.e. I do nothing & the output is perfectly formatted.

    Cheers,
    Bob

  3. With the rise of Rcpp, R is creeping toward the SAS paradigm: one syntax for running stat commands and another, closer to the metal, syntax to create such commands. Knowing which packages are (mostly) executed in C/C++ in some separate way, fork of Task View with only such packages, perhaps. As CRAN is, one has to look up a package to see imports, but that, in and of itself, doesn’t tell one how Rcpp was used to build the package.

    Some might consider a package built on C/C++ more robust, since the author “must be” a better coder. Or not.

  4. With the rise of Rcpp, R is creeping toward the SAS paradigm: one syntax for running stat commands and another, closer to the metal, syntax to create such commands. Knowing which packages are (mostly) executed in C/C++ in some separate way, fork of Task View with only such packages, perhaps. As CRAN is, one has to look up a package to see imports, but that, in and of itself, doesn’t tell one how Rcpp was used to build the package.

    Some might consider a package built on C/C++ more robust, since the author “must be” a better coder. Or not.

  5. Hi Robert,

    I really like that idea. Drew Schmidt showed that it’s fairly easy to calculate the percent of each package that’s coded in a particular language. It would be ideal to perform benchmarking, but that would be quite a lot of work. A shortcut would be to simply list the percent of compiled code in a package as a surrogate benchmark. Here’s Drew’s post on it:
    http://www.r-bloggers.com/how-much-of-r-is-written-in-r-part-2-contributed-packages/

    Cheers,
    Bob

  6. Hi Robert,

    I really like that idea. Drew Schmidt showed that it’s fairly easy to calculate the percent of each package that’s coded in a particular language. It would be ideal to perform benchmarking, but that would be quite a lot of work. A shortcut would be to simply list the percent of compiled code in a package as a surrogate benchmark. Here’s Drew’s post on it:
    http://www.r-bloggers.com/how-much-of-r-is-written-in-r-part-2-contributed-packages/

    Cheers,
    Bob

  7. Here’s a suggestion I think many R-developers like to see: a service for building packages on all architectures supported by CRAN. Having access to errors and warnings that do not occur on your native architecture would certainly be a great asset for the developer community. Moreover, I’m pretty sure this would prevent of communication between CRAN maintainers and package submitters. (volunteer time _is_ CRAN’s most valuable resource).

  8. Here’s a suggestion I think many R-developers like to see: a service for building packages on all architectures supported by CRAN. Having access to errors and warnings that do not occur on your native architecture would certainly be a great asset for the developer community. Moreover, I’m pretty sure this would prevent of communication between CRAN maintainers and package submitters. (volunteer time _is_ CRAN’s most valuable resource).

  9. Regarding package longevity, by using a repository like Github if the maintainer disappear or is just not interested any more one can fork the repo and continue developing the package.

  10. Regarding package longevity, by using a repository like Github if the maintainer disappear or is just not interested any more one can fork the repo and continue developing the package.

  11. In terms of formatted tables in Word, if you’re on Windows, the R package R2wd does a pretty decent job.

  12. In terms of formatted tables in Word, if you’re on Windows, the R package R2wd does a pretty decent job.

  13. Daily snapshots of R; well… it’s OK to have them,
    but that whole subversion-stuff is so outdated and awkward;
    development should switch to git…
    …let the people / community look at the stuff and do their own
    local branches, and provide their own code.
    This would make it much easier to provide patches / enhancements.
    The more fine-grained the development is, the easier it’s to include
    small, little changes. branch + merge, so wonderful with git.
    If the cathedral asks for help, well…. there is a way….

    1. Hi Oliver,

      GitHub certainly has several advantages compared to the way CRAN is currently implemented, but at least CRAN does have checks and balances that GitHub lacks. I suspect that this topic leaves the purview of the R Consortium and is more in the hands of the R Core team.

      At the very least, I’d prefer not to have to install devtools to load a package from GitHub.

      Cheers,
      Bob

  14. Daily snapshots of R; well… it’s OK to have them,
    but that whole subversion-stuff is so outdated and awkward;
    development should switch to git…
    …let the people / community look at the stuff and do their own
    local branches, and provide their own code.
    This would make it much easier to provide patches / enhancements.
    The more fine-grained the development is, the easier it’s to include
    small, little changes. branch + merge, so wonderful with git.
    If the cathedral asks for help, well…. there is a way….

    1. Hi Oliver,

      GitHub certainly has several advantages compared to the way CRAN is currently implemented, but at least CRAN does have checks and balances that GitHub lacks. I suspect that this topic leaves the purview of the R Consortium and is more in the hands of the R Core team.

      At the very least, I’d prefer not to have to install devtools to load a package from GitHub.

      Cheers,
      Bob

  15. Hi Bob,

    Working in in the insurance industry, I’d consider the integration with Excel (2010 onwards) far important than Word. I know there are a number of Excel-related packages but they all seem to have certain limitations. Especially, Excel PowerBI and R are still totally disconnected.

    From my point of view, there’s no need for a GUI. My colleagues have written VBA or SAS code for many years without any GUI. They would be equally happy to work with an IDE like RStudio.

    On Accuracy, I often hear that argument, too, but I don’t see how the Consortium could achieve this goal, eg testing the accuracy of highly specialized actuarial packages. I also wonder what kind of guarantees SAS gives.

    Btw, I always read your articles with great interest!

    Regards
    Wolfgang

    1. Hi Wolfgang,

      I agree that integration of R into Excel would be a very good goal. I assumed that since Microsoft has already announced the integration of it into SQL Server that they would do Excel next. Here’s hoping!

      Cheers,
      Bob

  16. Hi Bob,

    Working in in the insurance industry, I’d consider the integration with Excel (2010 onwards) far important than Word. I know there are a number of Excel-related packages but they all seem to have certain limitations. Especially, Excel PowerBI and R are still totally disconnected.

    From my point of view, there’s no need for a GUI. My colleagues have written VBA or SAS code for many years without any GUI. They would be equally happy to work with an IDE like RStudio.

    On Accuracy, I often hear that argument, too, but I don’t see how the Consortium could achieve this goal, eg testing the accuracy of highly specialized actuarial packages. I also wonder what kind of guarantees SAS gives.

    Btw, I always read your articles with great interest!

    Regards
    Wolfgang

    1. Hi Wolfgang,

      I agree that integration of R into Excel would be a very good goal. I assumed that since Microsoft has already announced the integration of it into SQL Server that they would do Excel next. Here’s hoping!

      Cheers,
      Bob

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.