Updated Comparison of R Graphical User Interfaces

I have just updated my detailed reviews of Graphical User Interfaces (GUIs) for R, so let’s compare them again. It’s not too difficult to rank them based on the number of features they offer, so let’s start there. I’m basing the counts on the number of dialog boxes in each category of four categories:

  • Ease of Use
  • General Usability
  • Graphics
  • Analytics

This is trickier data to collect than you might think. Some software has fewer menu choices, depending instead on more detailed dialog boxes. Studying every menu and dialog box is very time-consuming, but that is what I’ve tried to do. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own categories. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows how the various GUIs compare on the average rank of the four categories. R Commander is abbreviated Rcmdr, and R AnalyticFlow is abbreviated RAF. We see that BlueSky is in the lead with R-Instat close behind. As my detailed reviews of those two point out, they are extremely different pieces of software! Rather than spend more time on this summary plot, let’s examine the four categories separately.

Figure 1. Mean of each R GUI’s ranking of the four categories. To make this plot consistent with the others below, the larger the rank, the better.

For the category of ease-of-use, I’ve defined it mostly by how well each GUI does what GUI users are looking for: avoiding code. They get one point each for being able to install, start, and use the GUI to its maximum effect, including publication-quality output, without knowing anything about the R language itself. Figure two shows the result. JASP comes out on top here, with jamovi and BlueSky right behind.

Figure 2. The number of ease-of-use features that each GUI has.

Figure 3 shows the general usability features each GUI offers. This category is dominated by data-wrangling capabilities, where data scientists and statisticians spend most of their time. This category also includes various types of data input and output. BlueSky and R-Instat come out on top not just due to their excellent selection of data wrangling features but also due to their use of the rio package for importing and exporting files. The rio package combines the import/export capabilities of many other packages, and it is easy to use. I expect the other GUIs will eventually adopt it, raising their scores by around 40 points. JASP shows up at the bottom of this plot due to its philosophy of encouraging users to prepare the data elsewhere before importing it into JASP.

Figure 3. Number of general usability features for each GUI.

Figure 4 shows the number of graphics features offered by each GUI. R-Instat has a solid lead in this category. In fact, this underestimates R-Instat’s ability if you…

Continued…

Rexer Analytics Survey Results

Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.

Surveys of Use

One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics conducts such a survey every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 6a shows the tools that the 1,220 respondents reported using in 2015.

Figure 6a. Analytics tools used.
Figure 6a. Analytics tools used by respondents to the Rexer Analytics Survey. In this view, each respondent was free to check multiple tools.

We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.

The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be over counting. However, the survey also asked,  “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?”  Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 6b. Since other software combinations are likely, e.g. SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.

Figure 6b. The percent of survey respondents who checked each package as their primary tool.
Figure 6b. The percent of survey respondents who checked each package as their primary tool. Note that free and commercial versions of KNIME and RapidMiner are combined. Multiple tools from the same company are also combined. Only the top 10 are shown.

In this view we see R even more dominant, with over a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g. research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration and performance, rather than limiting analytic features or access to popular data formats.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Learning R: Live Webinar, Interactive Self-Paced, or Site Visit?

My recent blog post, Why R is Hard to Learn, must have hit a nerve as it was read by over 6,000 people in its first two days online.  If you’re using R to augment your work in SAS, SPSS or Stata or you’re considering switching to R, my workshops can help minimize many of those headaches by pointing out the commands and options that frustrate users of those packages the most. Also find out which of the thousands of R packages will give you the output you’re most used to.

My next two live webinars done in partnership with Revolution Analytics are in January:
R for SAS, SPSS and Stata Users
Managing Data with R (updated to include dplyr, broom, tidyr, etc.)
Course outlines and registration for both is here.

My R for SAS, SPSS and Stata Users workshop is also now available as a self-paced interactive video workshop at DataCamp.com.

I do site visits in partnership with RStudio.com, whose software I recommend and use in every form of my training.  If your company does its training through Xerox Learning Services, I also partner with them. For further details or to arrange a site visit, you can reach me at muenchen.bob@gmail.com.

Specifying Variables in R

R has several ways to specify which variables to use in an analysis. Some of the most frustrating errors can result from not understanding the order in which R searches for variables. This post demonstrates that order, hopefully smoothing your future use of R.

If all your variables are vectors in your workspace, using them in an analysis is easy: simply name them. For example, you could build a linear model (regression) using the lm function like this:

lm(y ~ x)

However, data frames exist for a good reason. They help organize variables and keep the values of each observation (the rows) locked together. For example, when you sort a data frame, all the rows of a data frame are moved, not just the single variable you’re sorting on. Once variables are stored in a data frame however, referring to them gets more complicated. R can include variables from multiple places (e.g. two data frames or a data frame and the workspace) so it becomes important to know your options and how R views them.

You can specify the names of both a data frame and a variable using the compound forms mydata$myvar or mydata[“myvar”]. However, that often means that you have to type the name of the data frame quite a lot.

If you use the form “with(mydata,…” then R will look in that data frame for the “short” variable names before it looks elsewhere, like in your workspace. That allows you to type the data frame name only once per function call, but in a long program you would still end up typing it a lot.

Modeling functions in R often let you specify “data = mydata” allowing you to use short variable names in formulas like “y ~ x”. The result is like the “with” function, you must type the data frame name once per function call. (SAS users take note: variables used outside of formulas will not be found with this approach!)

Finally, you can attach the data frame with “attach(mydata)”. This copies the variables into a temporary space that lets you then refer to them by their short names. This has the big advantage of allowing all the following function calls to use short variable names. Unfortunately, it has the big disadvantage of being confusing. Confusion #1 is that people feel that variables they create will go into the data frame automatically; they will not. Unless you specify a data frame using either mydata$newvar or mydata[“newvar”], new variables are created in your workspace. Confusion #2 is that R will look in your workspace before it looks at the attached versions of variables. So if variables with the same names exist there, those will be used instead. Confusion #3 is that even though detach(mydata) will reverse the process, if you run your program multiple times, you may have attached the data multiple times and detaching once does not fully undo the attached state. As confusing at that is, I use attach frequently and rarely get burned by it.

For example, with variables x and y stored in mydata (and nowhere else) you could do a linear regression model using any one of these approaches:

lm(mydata$y ~ mydata$x)

lm(mydata[“y”] ~ mydata[“x”])

with(mydata, lm(y ~ x))

lm(y ~ x, data = mydata)

attach(mydata)
lm(y ~ x)

As if that weren’t complicated enough, both x and y do not have to both be in the same data frame! The x variable could be in mydata and the y variable could be in the workspace or in an attached version of mydata or some other data frame. That would be dangerous, of course, since it would be up to you to ensure that the values of each observation match or the resulting model would be nonsense. However, this kind of flexibility can also be very useful.

With all this flexibility, it’s important to know the order in which R chooses variables. A simple example can show us the order R uses. Here I am creating four data frames whose x and y variables will have a slope that is indicated by the data frame name. For example, the variables in df10 have a slope of 10. This will make it easy for us to see which version of the variables R is using.

> y <- c(1,2,3,4,5,6,7,8,9,10)
> x <- c(1,2,5,5,5,5,5,8,9,10)
> df1    <- data.frame(x, y)     
> df10   <- data.frame(x, y = y*10  )
> df100  <- data.frame(x, y = y*100 )
> df1000 <- data.frame(x, y = y*1000)
> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Notice that I have deleted the original x and y variables so at the moment, varibles x and y exist only within the data frames. Running a regression with lm(y ~ x) will not work since R does not look into data frames unless you tell it to. Even if it did, it would have no way to know which set of x’s and y’s to use. Next I will take two different approaches to “selecting” a data frame. I attach df1 and copy the variables from df10 into the workspace.

> attach(df1)
> y <- df10$y
> x <- df10$x

Next, I do something rarely useful, calling a linear model using both “with” and “data=”. Which will dominate?

> with(df100, lm(y ~ x, data = df1000))

Call:
lm(formula = y ~ x, data = df1000)

Coefficients:
(Intercept)            x  
          0         1000

Since the slope is 1000, it’s clear that the “data=” argument was dominant. So R would look there first. If it found both x and y, it would stop looking. But if it only found one variable, it would continue to look elsewhere for the other. If the other variable where in the “with” data frame, it would then use it.

Next I’ll remove the “data” argument and see what happens.

> with(df100, lm(y ~ x))

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0          100

This time the “with” data frame was used for both variables. If variable either had not been in that data frame, R would have continued to look in the workspace and in the attached copy. But which would it use first? Next, I’m not specifying a data frame at all.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0           10

The slope of 10 tells us that it found the copies of x and y that I copied from df10 into the workspace. Let’s delete those variables and list the objects in our workspace to ensure that they’re gone.

> rm(y, x)
> ls()
[1] "df1"    "df10"   "df100"  "df1000"

Both x and y are clearly gone. So lets see if we can still use them.

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
          0            1

We deleted x and y but we can still use them! However, we see from the slope of 1 that R has used a different pair of x and y variables. They’re the ones that were copied to my search path when I used “attach(myDf1)”. I had to remember that I had attached them. It’s this kind of confusion that makes many R users avoid using attach. Finally, I’ll detach df1 and see what happens.

> detach(df1)
> lm(y ~ x)
Error in eval(expr, envir, enclos) : object 'y' not found

Now, even though all the data frames in our workspace contain an x and y variable, R does not look inside to find any of them. Even if it did, it would have no way of know which to choose.

We have seen that R looks in various places for variables. In order, they are: what you specify in “data=”, using “with(mydata,…”, your workspace and finally attached copies of your data frame. The most recently attached copies are the ones it will use first. I hope this will help you use R with both less typing and less confusion.

Why R is Hard to Learn

[An updated version of this article is here]

The open source R software for analytics has a reputation for being hard to learn. It certainly can be, especially for people who are already familiar with similar packages such as SAS, SPSS or Stata. Training and documentation that leverages their existing knowledge and points out where their previous knowledge is likely to mislead them can save much of frustration. This is the approach used in my books, R for SAS and SPSS Users and R for Stata Users as well as the workshops that are based on them.

Here is a list of complaints about R that I commonly hear from people learning it. In the comments section below, I’d like to hear about things that drive you crazy about R.

Misleading Function or Parameter Names (data=, sort, if)

The most difficult time people have learning R is when functions don’t do the “obvious” thing. For example when sorting data, SAS, SPSS and Stata users all use commands appropriately named “sort.” Turning to R they look for such a command and, sure enough, there’s one named exactly that. However, it does not sort data sets! Instead it sorts individual variables, which is often a very dangerous thing to do. In R, the “order” function sorts data sets and it does so in a somewhat convoluted way. However there are add-on packages that have sorting functions that work just as SAS/SPSS/Stata users would expect.

Perhaps the biggest shock comes when the new R user discovers that sorting is often not even needed by R. When other packages require sorting before they can do three common tasks:

  1. Summarizing / aggregating data
  2. Repeating an analysis for each group (“by” or “split file” processing)
  3. Merging files by key variables

R does not need to sort files before any of these tasks! So while sorting is a very helpful thing to be able to do for other reasons, R does not require it for these common situations.

Nonstandard Output

R’s output is often quite sparse. For example, when doing crosstabulation, other packages routinely provide counts, cell percents, row/column percents and even marginal counts and percents. R’s built-in table function (e.g. table(a,b)) provides only counts. The reason for this is that such sparse output can be readily used as input to further analysis. Getting a bar plot of a crosstabulation is as simple as barplot( table(a,b) ). This piecemeal approach is what allows R to dispense with separate output management systems such as SAS’ ODS or SPSS’ OMS. However there are add-on packages that provide more comprehensive output that is essentially identical to that provided by other packages.

Too Many Commands

Other statistics packages have relatively few analysis commands but each of them have many options to control their output. R’s approach is quite the opposite which takes some getting used to. For example, when doing a linear regression in SAS or SPSS you usually specify everything in advance and then see all the output at once: equation coefficients, ANOVA table, and so on. However, when you create a model in R, one command (summary) will provide the parameter estimates while another (anova) provides the ANOVA table. There is even a command “coefficients” that gets only that part of the model. So there are more commands to learn but fewer options are needed for each.

R’s commands are also consistent, working across all the modeling types that they might apply to. For example the “predict” function works the same way for all types of models that might make predictions.

Sloppy Control of Variables

When I learned R, it came as quite a shock that in a single analysis you can include variables from multiple data sets. That usually requires that the observations be in identical order in each data set. Over the years I have had countless clients come in to merge data sets that they thought had observations in the same order, but were not! It’s always safer to merge by key variables (like ID) if possible. So by enabling such analyses R seems to be asking for disaster. I still recommend merging files when possible by key variables before doing an analysis.

So why does R allow this “sloppiness”? It does so because it provides very useful flexibility. For example, might plot regression lines of variable X against variable Y for each of three groups on the same plot. Then you can add group labels directly onto the graph. This lets you avoid a legend that makes your readers look back and forth between the legend and lines. The label data would contain only three variables: the group labels and the coordinates at which you wish them to appear. That’s a data set of only 3 observations so merging that with the main data set makes little sense.

Loop-a-phobia

R has loops to control program flow, but people (especially beginners) are told to avoid them. Since loops are so critical to applying the same function to multiple variables, this seems strange. R instead uses the “apply” family of functions. You tell R to apply the function to either rows or columns. It’s a mental adjustment to make, but the result is the same.

Functions That Act Like Procedures

Many other packages, including SAS, SPSS and Stata have procedures or commands that do typical data analyses which go “down” through all the observations. They also have functions that usually do a single calculation across rows, such as taking the mean of some scores for each observation in the data set. But R has only functions and those functions can do both. How does it get away with that? Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force then to go in either direction. So it’s true that in R, functions act like procedures and functions. Coming from other software, that’s a wild new idea.

Naming and Renaming Variables is Way Too Complicated

Often when people learn how R names and renames its variables they, well, freak out. There are many ways to name and rename variables because R stores the names as a character variable. Think of all the ways you know how to fiddle with character variables and you’ll realize that if you could use them all to name or rename variables, you have way more flexibility than the other data analysis packages. However, how long did it take you to learn all those tricks? Probably quite a while! So until someone needs that much flexibility, I recommend simply using R to read variable names from the same source as you read the data. When you need to rename them, use an add-on package that will let you do so in a style that is similar to SAS, SPSS or Stata. An example is here. You can convert to R’s built-in approach when you need more flexibility.

Inability to Analyze Multiple Variables

One of the first functions beginners typically learn is mean(X). As you might guess, it gets the mean of the X variable’s values. That’s simple enough. It also seems likely that to get the mean of two variables, you would just enter mean(X, Y). However that’s wrong because functions in R typically accept only single objects. The solution is to put those two variables into a single object such as a data frame: mean( data.frame(x,y) ). So the generalization you need to make isn’t from one variable to multiple variables, but rather from one object (a variable) to another (a data set). Since other software packages are not object oriented, this is a mental adjustment people have to make when coming to R from other packages. (Note to R gurus: I could have used colMeans but it does not make this example as clear.)

Poor Ability to Select Variable Sets

Most data analysis packages allow you to select variables that are next to one another in the data set (e.g. A–Z or A TO Z). R generally lacks this useful ability. It does have a “subset” function that allows the form A:Z, but that form works only in that function. There are many various work-arounds for this problem but most do seem rather convoluted compared to other software. Nothing’s perfect!

Too Much Complexity

People complain that R has too much complexity overall compared to other software. This comes from the fact that you can start learning software like SAS and SPSS with relatively few commands: the basic ones to read and analyze data. However when you start to become more productive you then have to learn whole new languages! To help reduce repitition in your programs you’ll need to learn the macro language. To use the output from one procedure in another, you’ll need to learn an output management system like SAS ODS or SPSS OMS. To add new capabilities you need to learn a matrix language like SAS IML, SPSS Matrix or Stata Mata. Each of these languages has its own commands and rules. There are also steps for tranferring data or parameters from one language to another. R has no need for that added complexity because it integrates all these capabilities into R itself. So it’s true that beginners have to see more complexity in R. Howevever, as they learn more about R, they begin to realize that there is actually less complexity and more power in R!

Lack of Graphical User Interface (GUI)

Like most other packages R’s full power is only accessible through programming. However unlike the others, it does not offer a standard GUI to help non-programmers do analyses. The two which are most like SAS, SPSS and Stata are R Commander and Deducer. While they offer enough analytic methods to make it through an undergraduate degree in statistics, they lack control when compared to a powerful GUI such as those used by SPSS or JMP. Worse, beginners must initially see a programming environment and then figure out how to find, install, and activate either GUI. Given that GUIs are aimed at people with fewer computer skills, this is a problem.

Conclusion

Most of the issues described above are misunderstandings caused by expecting R to work like other software that the person already knows. What examples like this have you come across?

Acknowledgements

Thanks to Patrick Burns and Tal Galili for their suggestions that improved this post.