A Comparative Review of the Rattle GUI for R

Introduction

Rattle is a popular free and open source Graphical User Interface (GUI) for the R software, one that focuses on beginners looking to point-and-click their way through data mining tasks. Such tasks are also referred to as machine learning or predictive analytics.  Rattle’s name is an acronym for “R Analytical Tool To Learn Easily.” Rattle is available on Windows, Mac, and Linux systems.

This post is one of a series of reviews which aim to help non-programmers choose the GUI that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

Figure 1. The Rattle interface with the “Data” tab chosen, showing which file I’m reading, and the roles of the variables will play in analyses. The role assigned to each variable is critically important. Note the all-important “Execute” button in the upper left of the screen. Nothing happens until it’s clicked.

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The Help Desks at most universities are flooded with such calls at the beginning of each semester!

The steps to install Rattle are:

  1. Install R
  2. In R, install the toolkit that Rattle is written in by executing the command: install.packages(“RGtk2”)
  3. Also in R, install Rattle itself by executing the command:
    install.packages(“rattle”, dependencies=TRUE)
    The very  latest development version is available here.
    Note that while Rattle’s name is capitalized, the name of the rattle package is spelled in all lower-case letters!
  4. If you wish to take advantage of interactive visualization (highly recommended) then install the GGobi software from: http://www.ggobi.org/downloads/.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

Rattle’s complete capability was designed and programmed by Graham Williams of Togaware. As a result, it doesn’t have plug-ins, but it does include a comprehensive set of data mining tools.

 

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

Rattle is run as a part of R itself, so the steps to start it begin with starting R:

  1. Start R.
  2. Load Rattle from your library by executing the command: library(“rattle”)
  3. Start Rattle by executing the command: rattle()

 

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

Rattle’s data editor is unique for a GUI in that it does not offer a way to create a data set. It lets you edit any data set you open using R’s built-in edit function, but that function offers very few features. Clicking on a variable name will cause a dialog to open, offering to change the variable’s name or type as numeric or character (see Figure 2). Rattle automatically converts variables that have fewer than 10 values into “categorical” ones. R would call these factors. You can always recode variables from numeric to categorical (or vice versa) in the “Transform” tab (see Data Management section).

Figure 2. Rattle uses R’s built-in edit function as its data editor. Here I clicked on the name of the variable “Rainfall” to show how you might rename it or change its data type.

 

Data Import

Since R GUIs are using R to do the work behind the scenes, they often include the ability to read a wide range of files, including SAS, SPSS, and Stata. Some, like BlueSky Statistics, also include the ability to read directly from SQL databases. Of course you can always use R code to import data from any source and then continue to analyze it using any GUI, but the point of GUIs is to avoid programming.

Rattle skips many common statistical data formats, but it includes a couple exclusive ones, such as the Attribute-Relation File Format used by other data mining tools. It also includes “corpus” which reads in text documents, and it then it performs the popular tf-idfcalculation to prepare them for analysis using the other numerically-based analysis methods.

On its “Data” tab, Rattle offers several formats:

  1. File: CSV
  2. File: TXT
  3. File: Excel
  4. Attribute-Relation File Format (ARFF)
  5. Open Database Connectivity (ODBC)
  6. R Dataset
  7. RData File
  8. Library
  9. Corpus (for text analysis)
  10. Script

 

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as  BlueSky Statistics or the R Commander can handle all, or nearly all, of these tasks.

Rattle provides minimal data management tools. Its designer chose to focus on reading a single data set, and making transformations that are common in data mining projects quick and easy. More complex  data management tasks are left to other tools such as SQL in a database before the data set is read in, or using R programming.

Rattle’s “Transform” tab cycles through various data management “types.”  The way it works is quite unique. As you can see in Figure X, I have selected the Transform tab by clicking on it. I then held the CTRL key down to select several variables that are highlighted in blue. If the variables had been next to one another, I could have clicked on the first one, then shift-clicked on the last to select them all. Next I chose my transformation, by choosing “Recode” and then “Recenter.” Finally, I clicked the “Execute” button (or F2) to complete the process by adding three new recoded variables to the data set. Original variables are never changed, and you never have the ability to choose the name of the new variable(s). A prefix is appended to the variable name(s) automatically to speed the process. In this case, my Rainfall variable was transformed into “RRC_Rainfall”. The RRC prefix stands for “Recoded, Re-Centered.”

Whenever a variable is transformed, its status in the “Data” tab switches from “Input” to “Ignore”, while the transformed version of variable enters the data with an “Input” role.

Figure 3. Rattle’s “Transform” tab with three variables selected. The “Recode” sub-tab is also selected and the “Recenter” transformation is chosen. When the “Explore” button is clicked, the newly tranformed variables will be appended to the data set with a prefix indicating the type of transformation performed.

 

As easy as some transformations are, other transformations are impossible. For example, if you had a formula to calculate recommended daily allowances of vitamins, there’s no way to do it. Conditional transformations, those which have different formulas for different subsets of the observations (e.g. daily allowances of vitamins calculated differently for men and women) are also not possible. Here are the available transformations:

Transform> Rescale> Normalize

  • Recenter (Z-score)
  • Scale 0 to 1
  • (Var – Median)/Mean Absolute Deviation (MAD)
  • Natural Log
  • Log 10
  • Matrix (divide all by a constant)

Transform> Impute

  • Replace missing with zeros (e.g. requesting nothing gets you nothing)
  • Mean
  • Median
  • Mode
  • Constant

Transform> Recode

  • Binning> Quantiles
  • Binning> KMeans clusters
  • Binning> Equal width intervals
  • Binning> N Equally spaced intervals
  • Indicator variables
  • Join Categorics
  • As Categoric
  • As Numeric

Continued here…

A Comparative Review of the BlueSky Statistics GUI for R

Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

 

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

 

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

 

Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open”:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • Excel (old and new xls file types)
  • Dbase’s DBF
  • SPSS (.sav)
  • SAS binary files (sas7bdat)
  • Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

  • Microsoft Access
  • Microsoft SQL Server
  • MySQL
  • PostgreSQL
  • SQLite

 

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

  1. Missing Values
  2. Compute
  3. Bin Numeric Variables
  4. Recode (able to recode many at once)
  5. Make Factor Variable (able to covert many at once)
  6. Transpose
  7. Transform (able to transform many at once)
  8. Sample Dataset
  9. Delete Variables
  10. Standardize Variables (able to standardize many at once)
  11. Aggregate (outputs results to a new dataset)
  12. Aggregate (outputs results to a printed table)
  13. Subset (outputs to a new data et)
  14. Subset (outputs results to a printed table)
  15. Merge Datasets
  16. Sort (outputs results to a new dataset)
  17. Sort (outputs results to a printed table)
  18. Reload Dataset from File
  19. Refresh Grid
  20. Concatenate Multiple Variables (handling missing values)
  21. Legacy (does same things but using base R code)
  22. Reshape (long to wide)
  23. Reshape (wide to long)

Continued here…

Using Discussion Forum Activity to Estimate Analytics Software Market Share

I’m finally getting around to overhauling the Discussion Forum Activity section of The Popularity of Data Analysis Software. To save you the trouble of reading all 43 pages, I’m posting just this section below.

Discussion Forum Activity

Another way to measure software popularity is to see how many people are helping one another use each package or language. While such data is readily available, it too has its problems. Menu-driven software like SPSS or workflow-driven software such as KNIME are quite easy to use and tend to generate fewer questions. Software controlled by programming requires the memorization of many commands and requiring more support. Even within languages, some are harder to use than others, generating more questions (see Why R is Hard to Learn).

Another problem with this type of data is that there are many places to ask questions and each has its own focus. Some are interested in a classical statistics perspective while others have a broad view of software as general-purpose programming languages. In recent years, companies have set up support sites within their main corporate web site, further splintering the places you can go to get help. Usage data for such sites is not readily available.

Another problem is that it’s not as easy to use logic to focus in on specific types of questions as it was with the data from job advertisements and scholarly articles discussed earlier. It’s also not easy to get the data across time to allow us to study trends.  Finally, the things such sites measure include: software group members (a.k.a. followers), individual topics (a.k.a. questions or threads), and total comments across all topics (a.k.a. total posts). This makes combining counts across sites problematic.

Two of the biggest sites used to discuss software are LinkedIn and Quora. They both display the number of people who follow each software topic, so combining their figures makes sense. However, since the sites lack any focus on analytics, I have not collected their data on general purpose languages like Java, MATLAB, Python or variants of C. The results of data collected on 10/17/2015 are shown here:

LinkedIn_Quora_2015

We see that R is the dominant software and that moving down through SAS, SPSS, and Stata results in a loss of roughly half the number of people in each step. Lavastorm follows Stata, but I find it odd that there was absolutely zero discussion of Lavastorm on Quora. The last bar that you can even see on this plot is the 62 people who follow Minitab. All the ones below that have tiny audiences of fewer than 10.

Next let’s examine two sites that focus only on statistical questions: Talk Stats and Cross Validated. They both report the number of questions (a.k.a. threads) for a given piece of software, allowing me to total their counts:

CrossValidated_TalkStats_2015

We see that R has a 4-to-1 lead over the next most popular package, SPSS. Stata comes in at 3rd place, followed by SAS. The fact that SAS is in fourth place here may be due to the fact that it is strong in data management and report writing, which are not the types of questions that these two sites focus on. Although MATLAB and Python are general purpose languages, I include them here because the questions on this site are within the realm of analytics. Note that I collected data on as many packages as were shown in the previous graph, but those not shown have a count of zero. Julia appears to have a count of zero due to the scale of the graph, but it actually had 5 questions on Cross Validated.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Rexer Analytics Survey Results

Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.

Surveys of Use

One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics conducts such a survey every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 6a shows the tools that the 1,220 respondents reported using in 2015.

Figure 6a. Analytics tools used.
Figure 6a. Analytics tools used by respondents to the Rexer Analytics Survey. In this view, each respondent was free to check multiple tools.

We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.

The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be over counting. However, the survey also asked,  “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?”  Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 6b. Since other software combinations are likely, e.g. SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.

Figure 6b. The percent of survey respondents who checked each package as their primary tool.
Figure 6b. The percent of survey respondents who checked each package as their primary tool. Note that free and commercial versions of KNIME and RapidMiner are combined. Multiple tools from the same company are also combined. Only the top 10 are shown.

In this view we see R even more dominant, with over a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g. research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration and performance, rather than limiting analytic features or access to popular data formats.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Goals for the New R Consortium

by Bob Muenchen

The recently-created R Consortium consists of companies that are deeply involved in R such as RStudio, Microsoft/Revolution Analytics, Tibco, and others. The Consortium’s goals include advancing R’s worldwide promotion and support, encouraging user adoption, and improving documentation and tools. Those are admirable goals and below I suggest a few specific examples that the consortium might consider tackling.

R Consortium

As I work with various organizations to help them consider migrating to R, common concerns are often raised. With thousands of packages to choose from, where do I start? Do packages go through any reliability testing? What if I start using a package and its developer abandons it?  These, and others, are valid concerns that the R Consortium could address.

Choosing Packages

New R users face a daunting selection of thousands of packages. Some guidance is provided by CRAN’s Task Views. In R’s early years, this area was quite helpful in narrowing down a package search. However, R’s success has decreased the usefulness of Task Views. For example, say a professor asks a grad student to look into doing a cluster analysis. In SAS, she’ll have to choose among seven procedures. When considering the Task View on the subject, she’ll be presented with 105 choices in six categories!  The greater selection is one of R’s strengths, but to encourage the adoption of R by a wider community it would be helpful to list the popularity of each package. The more popular packages are likely to be the most useful.

R functions are integrated into other software such as Alteryx, IBM SPSS Statistics, KNIME, and RapidMiner. Some are also called from R user interfaces such as Deducer, R Commander, and RATTLE. Within R, some packages depend on others, adding another vote of confidence. The R Consortium could help R users by documenting these various measures of popularity, perhaps creating an overall composite score.

Accuracy

People often ask how they can trust the accuracy (or reliability) of software that is written by a loosely knit group of volunteers, when there have even been notable lapses in the accuracy of commercial software developed by corporate teams [1]. Base R and its “recommended packages” are very well tested, and the details of the procedures are documented in The R Software Development Life Cycle. That set of software is substantial, the equivalent of Base SAS + GRAPH + STAT + ETS + IML + Enterprise Miner (excluding GUIs, Structural Equation Modeling, and Multiple Imputation, which are in add-on packages). Compared to SPSS, it’s the rough equivalent to IBM SPSS Base + Statistics + Advanced Stat. + Regression + Forecasting + Decision Trees + Neural Networks + Bootstrapping.

While that set is very capable, it still leaves one wondering about all the add-on packages. Performing accuracy tests is very time consuming work [2-5] and even changing the options on the same routine can affect accuracy [6].  Increasing the confidence that potential users have in R’s accuracy would help to increase the use of the software, one of the Consortium’s goals. So I suggest that they consider ways to increase the reliability testing of functions that are outside the main R packages.

Given the vast number of R packages available, it would be impossible for the Consortium to perform such testing on all packages. However, for widely used packages, it might behoove the Consortium to use its resources to develop such tests themselves. A web page that referenced Consortium testing, as well as testing from any source, would be helpful.

Package Longevity

If enough of a package’s developers got bored and moved on or, more dramatically, were hit by the proverbial bus, development would halt. Base R plus recommended packages has the whole R Development Core Team backing them up. Other packages are written by employees of companies. In such cases, it is unclear whether the packages are supported by the company or by the individual developer(s).

Using the citation function will list a package’s developers. The more there are, the better chance there is of someone taking over if the lead developer moves on. The Consortium could develop a rating system that would provide guidance along these lines. Nothing lasts forever, but knowing the support level a package has would be of great help when choosing which to use.

Encourage Support and Use of Key Generic Functions

Some fairly new generic functions play a key role in making R easier to use. For example, David Robinson’s broom package contains functions that translate the output of modeling functions from list form into data frames, making output management much easier. Other packages, including David Dahl’s xtable and Philip Leifeld’s texreg, do a similar translation to present the output in nicely formatted forms for publishing. Those developers have made major contributions to R by writing all the methods themselves. The R Consortium could develop a list of such functions and encourage other developers to add methods to them, when appropriate. Such widely applicable functions could also benefit from having the R Consortium support their development, assuring longer package longevity and wider use.

Output to Microsoft Word

R has the ability to create beautiful output in almost any format you would like, but it takes additional work.  Its competition, notably SAS and SPSS, let you choose the font and full formatting of your output tables at installation. From then on, any time you want to save output to a word processor, it’s a simple cut & paste operation. SPSS even formats R output to fully formatted tables, unlike any current R IDEs. Perhaps the R Consortium could pool the resources needed to develop this kind of output. If so, it would be a key aspect of their goal of speeding R’s adoption. (I do appreciate the greater power of LaTeX and the ease of use of knitr and Rmarkdown, but they’ll never match the widespread use of Word.)

Graphical User Interface

Programming offers the greatest control over an analysis, but many researchers don’t analyze data often enough to become good programmers; many simply don’t like programming. Graphical User Interfaces (GUIs) help such people get their work done more easily. The traditional menu-based systems, such as R Commander or Deducer, make one-time work easy, but they don’t offer a way to do repetitive projects without relying on the code that non-programmers wish to avoid.

Workflow-based GUIs are also easy to use and, more importantly, they save all the steps as a flowchart. This allows you to check your work and repeat it on another data set simply by updating the data import node(s) and clicking “execute.” To take advantage of this approach, Microsoft’s Revolution R Enterprise integrates into Alteryx and KNIME, and Tibco’s Enterprise Runtime for R integrates into KNIME as well. Alteryx is a commercial package, and KNIME is free and open source on the desktop. While both have commercial partners, each can work with the standard community version of R as well.

Both packages contain many R functions that you can control with a dialog box. Both also allow R programmers to add a programming node in the middle of the workflow.  Those nodes can be shared, enabling an organization to get the most out of both their programming and non-programming analysts. Both systems need to add more R nodes to be considered general-purpose R GUIs, but they’re making fairly rapid progress on that front. In each system, it takes less than an hour to add a node to control a typical R function.

The R Consortium could develop a list of recommended steps for developers to consider. One of these steps could be adding nodes to such GUIs. Given the open source nature of R, encouraging the use of the open source version of KNIME would make the most sense. That would not just speed the adoption of R, it would enable its adoption by the large proportion of analysts who prefer not to program. For the more popular packages, the Consortium could consider using their own resources to write such nodes.

Conclusion

The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider?  Let’s hear your ideas in the comments section below.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Acknowledgements

Thanks to Drew Schmidt and Michael Berthold for their suggestions that improved this post.

References

  1. Micah Altman (2002), A Review of JMP 4.03 With Special Attention to its Numerical Accuracy, The American Statistician, 56:1, 72-75, DOI: 10.1198/000313002753631402
  2. D. McCullough (1998), Assessing the Reliability of Statistical Software: Part I, The American Statistician, 52:4, 358-366
  3. D. McCullough (1999), Assessing the Reliability of Statistical Software: Part II, The American Statistician, 53:2, 149-159
  4. Kellie B. Keeling, Robert J. Pavur (2007), A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Vol. 51, Issue 8, pp. 3811-3831
  5. Oluwartotimi O. Odeh, Allen M. Featherstone and Jason S. Bergtold (2010), Reliability of Statistical Software, American Journal of Agricultural Economics,doi: 1093/ajae/aaq068
  6. Jason S. Bergtold, Krishna Pokharel and Allen Featherstone (2015), Selected Paper prepared for presentation at the 2015 Agricultural & Applied Economics Association and Western Agricultural Economics Association Annual Meeting, San Francisco, CA, July 26-28

Free Webinar: Intro to SparkR

Are you interested in combining the power of R and Spark?  An “Intro to SparkR”
webinar will take place on July 15, 2015 at 10 am California time. Everyone is welcome
to attend.

Agenda:
– What is SparkR?
– Recent improvements to SparkR
– SparkR Roadmap
– Live Demo
– Q & A

Speaker:
Shivaram Venkataraman, Co-author of SparkR

Duration: 45-60 minutes

Cost: $0

Location: Internet

Registration:
https://attendee.gotowebinar.com/register/4761879673365920770

Estimating Analytics Software Market Share by Counting Books

Below is the latest update to The Popularity of Data Analysis Software.

Books

The number of books published on each software package or language reflects its relative popularity. Amazon.com offers an advanced search method which works well for all the software except R and the general-purpose languages such as Java, C, and MATLAB. I did not find a way to easily search for books on analytics that used such general purpose languages, so I’ve excluded them in this section.

The Amazon.com advanced search configuration that I used was (using SAS as an example):

Title: SAS -excerpt -chapter -changes -articles 
Subject: Computers & Technology
Condition: New
Format: All formats
Publication Date: After January, 2000

The “title” parameter allowed me to focus the search on books that included the software names in their titles. Other books may use a particular software in their examples, but they’re impossible to search for easily.  SAS has many manuals for sale as individual chapters or excerpts. They contain “chapter” or “excerpt” in their title so I excluded them using the minus sign, e.g. “-excerpt”. SAS also has short “changes and enhancements” booklets that the developers of other packages release only in the form of flyers and/or web pages, so I excluded “changes” as well. Some software listed brief “articles” which I also excluded. I did the search on June 1, 2015, and I excluded excerpts, chapters, changes, and articles from all searches.

“R” is a difficult term to search for since it’s used in book titles to indicate Registered Trademark as in “SAS(R)”. Therefore I verified all the R books manually.

The results are shown in Table 1, where it’s clear that a very small number of analytics software packages dominate the world of book publishing. SAS has a huge lead with 576 titles, followed by SPSS with 339 and R with 240. SAS and SPSS both have many versions of the same book or manual still for sale, so their numbers are both inflated as a result. JMP and Hadoop both had fewer than half of R’s count and then Minitab and Enterprise Miner had fewer then half again as many. Although I obtained counts on all 27 of the domain-specific (i.e. not general-purpose) analytics software packages or languages shown in Figure 2a, I cut the table off at software that had 8 or fewer books to save space.

Software        Number of Books 
SAS                  576
SPSS Statistics      339
R                    240    [Corrected from: 172]
JMP                   97
Hadoop                89
Stata                 62
Minitab               33
Enterprise Miner      32

Table 1. The number of books whose titles contain the name of each software package.

[Correction: Thanks to encouragement from Bernhard Lehnert (see comments below) the count for R has been corrected from 172 to the more accurate 240.]

R #1 by Wide Margin in Latest KDnuggets Poll

The results of the latest KDnuggets Poll on software for Analytics, Big Data and Data Mining are out, and R has moved into the #1 position by a wide margin. I’ve updated the Surveys of Use section of The Popularity of Data Analysis Software to include a subset of those results, which I include here:

…The results of a similar poll done by the KDnuggets.com web site in May of 2015 are shown in Figure 6b. This one shows R in first place with 46.9% of users reporting having used it for a real project. RapidMiner, SQL, and Python follow quite a bit lower with around 30% of users. Then at around 20% are Excel, KNIME and HADOOP. It’s interesting to see what has happened to two very similar tools, RapidMiner and KNIME. Both used to be free and open source. RapidMiner then adopted a commercial model, with an older version still free. KNIME kept its desktop version free and, likely as a result, its use has more than tripled over the last three years. SAS Enterprise Miner uses a very similar workflow interface, and its reported use, while low, has almost doubled over the last three years. Figure 6b only shows those packages that have at least 5% market share. KDnuggets’ original graph and detailed analysis are here.

KDnuggests 2015
Figure 6b. Percent of respondents that used each software in KDnuggets’ 2015 poll. Only software with 5% market share are shown. The % alone is the percent of tool voters that used only that tool alone. For example, only 3.6% of R users have used only R, while 13.7% of RapidMiner users indicated they used that tool alone. Years are color coded, with 2015, 2014, 2013 from top to bottom.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it!  For students & academics, it’s $9. I also do R training on-site.

R Now Contains 150 Times as Many Commands as SAS

by Bob Muenchen

In my ongoing quest to analyze the world of analytics, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

Fig_9_CRAN
Figure 9. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in
Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R functions for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories except GitHub, on that date there were approximately 182,393 total functions in R. In total, R has over 150 times as many commands as SAS.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it!  For students & academics, it’s $9. I also do R training on-site.

I’ve Been Replaced by an Analytics Robot

It was only a few years ago when the N.Y. Times declared my job “sexy”.  My old job title of statistician had sounded dull and stodgy, but then it became filled with exciting jargon: I’m a data scientist doing predictive analytics with (occasionally) big data. Three hot buzzwords in a single job description! However, in recent years, the powerful technology that has made my job so buzzworthy has me contemplating the future of the field. Computer programs that automatically generate complex models are becoming commonplace. Rob Hyndman’s forecast package for R, SAS Institite’s Forecast Studio, and IBM’s SPSS Forecasting offer the ability to generate forecasts that used to require years of training to develop. Similar tools are now available for other types of models as well.

Countless other careers have been eliminated due to new technology. The United States previously had over 70% of the population employed in farming and fewer than 2% are farmers today. Things change, people move on to other careers. The KDnuggests web site recently asked its readers, “When will most expert-level Predictive Analytics/Data Science tasks – currently done by human Data Scientists – be automated?” Fifty-one percent of the respondents – most of them data scientists themselves – estimated that this would happen within 10 years. Not all the respondents had such a dismal view though; 19% said that this would never happen.

My brain being analyzed by the machine that replaced my brain!
My brain being analyzed by the machine that replaced my brain! (Photograpy by Mike O’Neil)

If you had asked me in 1980 what would be the very last part of my job to be eliminated through automation, I probably would have said: brain wave analysis. It had far more steps involved than any other type of work I did. We were measuring the electrical activity of many parts of the brain, at many frequencies, thousands of times per second. An analysis that simply compared two groups would take many weeks of full-time work. Surprisingly, this was the first part of my job to be eliminated. However, our statistical consulting team supports many different departments, so I didn’t really notice when work stopped arriving from the EEG Lab. Years later I got a call from the new lab director offering to introduce me to my replacement: a “robot” named LORETA.

When I visited the lab, I was outfitted with the usual “bathing cap” full of electrodes. EEG paste (essentially K-Y jelly) was squirted into a hole in each electrode to ensure a good contact and the machine began recording my brain waves. I used bio-feedback to generate alpha waves which made a car go around a track in a simple video game. Your brain creates alpha waves when you get into a very relaxed, meditative state. Moments after I finished, LORETA had already analyzed my brain waves. “She” had done several weeks of analysis in just a few moments.

So that part of my career ended years ago, but I didn’t really notice it at the time. I was too busy using the time LORETA freed up to learn image analysis using ImageJ, text mining using WordStat and SAS Text Miner, and an endless variety of tasks using the amazing
R language. I’ve never had a moment when there wasn’t plenty of interesting new work to do.

There’s another aspect to my field that’s easy to overlook. When I began my career, 90% of the time was spent “battling” computers. They were incredibly difficult to operate. Today someone may send you a data file and you’ll be able to see the data moments after receiving it. In 1980 data arrived on tapes, and every computer manufacturer used a different tape format, each in numerous incompatible variations. Unless you had a copy of the program that created a tape, it might take days of tedious programming just to get the data off of it. Even asking the computer to run a program required error-prone Job Control Language. So from that perspective, easier-to-use computing technology has already eliminated 90% of what my job used to be. It wasn’t the interesting part of the job, so it was a change for the better.

Will the burgeoning field of data science eventually put itself out of business by developing a LORETA for every problem that needs to be solved? Will we just be letting our Star-Trek-class computers and robots do our work for us while we lounge around self-actualizing? Perhaps some day, but I doubt it will happen any time soon!

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it!  For students & academics, it’s $9. I also do R training on-site.