What R Has Been Missing

While R has more methods than any other analytics software, it has been missing a crucial feature found in most other packages. SPSS Modeler had it first, way back when they still called it Clementine. Then SAS Institute realized how crucial it was to productivity and added it to Enterprise Miner. As its reputation spread, it was added to RapidMiner, Knime, Statistica, Weka, and others. An early valiant attempt was made to add it to R. What is it?  It’s the flowchart-style graphical user interface. And it will soon be available in R at last.

While menu-driven interfaces such as R Commander, Deducer or SPSS are somewhat easier to learn, the flowchart interface has two important advantages. First, you can often get a grasp of the big picture as you see steps such as separate files merging into one, or several analyses coming out of a particular data set (see figure). Second, and more important, you have a precise record of every step in your analysis. This allows you to repeat an analysis simply by changing the data inputs. Instead, menu-driven interfaces require that you switch to the programs that they create in the background if you need to automatically re-run many previous steps. That’s fine if you’re a programmer, but if you were a good programmer, you probably would not have been using that type of interface in the first place!

Alteryx GUI Screen Shot
Alteryx’ flowchart user interface, soon to be added to Revolution R Enterprise.

This week Revolution Analytics and Alteryx announced that future versions of Revolution R Enterprise will include Alteryx’ flowchart-style graphical user interface. Alteryx has traditionally focused on the analysis of spatial data, only adding predictive analytics in 2012 (skip 37 minutes into this presentation.)  This partnership will also allow them to add Revolution’s big data features to various Alteryx products. Both companies are likely to get a significant boost in sales as a result.

While I expect both companies will benefit from this partnership, they could do much better. How? By making the Alteryx interface available for the community (free) version of R. If most R users were familiar with this interface, they would be much more likely to choose Alteryx’ tools when they needed them, instead of a competitor’s. When people needed big data tools for R, they’d be more likely to turn to Revolution Analytics. I am convinced that as great as R’s success has been, it could be greater still with a top-quality flowchart user interface that was freely available to all R users. Given the great advantages that this type of interface offers, it’s just a matter of time until a free version appears. The only question is: who will offer it?

[Update: It turns out that Alteryx is already offering a free version that works with the community version of R! See the comment from Dan Putler, product manager and one of the primary developers of Alteryx’s R-based predictive analytics and business data mining tools. I’ll be trying this out and will report my experiences in a future blog post.]

35 thoughts on “What R Has Been Missing”

  1. I look forward to seeing this function. I have also noticed that less is being offered as “free” and more is put into the commercial areas. This is expected, but say good-bye to open-source and free! Have you seen the release of 9.4 SAS and the tools? Enterprise Miner in SAS is fabulous and the response displays that. R is going to have to make quite the geometric leap to compete there and other aspects. I enjoy these discussions very much and the competition is good for all of us.

    1. Hi Mark,

      I haven’t seen much of 9.4 yet, but the new high performance procedures in SAS might finally open HPC to the masses. Well, the masses of SAS programmers at least. Enterprise Miner has a good interface, but I wish they’d hire an artist or two to make things look better. SPSS Modeler has a beautiful design. I definitely agree: competition helps us all.

      Cheers,
      Bob

  2. I use R every day. I have used SAS on a daily basis and I’ve also used Enterprise Miner. I’ve never felt remotely as productive with a GUI as I do by coding. It’s pretty simple, using the mouse is relatively slow compared to the keyboard. There’s no way in hell I’d hire a data scientist that relied primarily on GUI.

    1. Hi John,

      Thanks for pointing that out. I didn’t mention AnalyticFlow as I’m under the impression that it’s a flowchart that helps you program in a more organized manner. All the other flowchart GUIs I mentioned offer programming as an option, but their main mission is to help you avoid programming altogether.

      Cheers,
      Bob

  3. This is very exciting news. Although I enjoy seeing inside the black box, there are times when having a GUI would really help me see the big picture instead of having to rely on the comments I’ve put in front of each set of code.

    1. Hi Raphael,

      I should have mentioned that you can still see the code that the Alteryx interface writes for you if you like. Then knowing the function calls, you can dig into the code of the R function itself. So full knowledge is at your fingertips, if you have time to dig into it.

      Cheers,
      Bob

  4. I hope it’s more thoughtfully implemented than things like Rapid Miner and some other GUI-based tools I’ve tried.

    Nothing more frustrating than having a file-read node that outputs 8 attributes — you can see this by mousing over the node’s output — but which gives you an error at the next node, saying that it needs to have one or more attributes to work.

    Maybe it needs particular names for attributes, or particular attribute types? Maybe it’s using “attributes” with two different meanings? Maybe there are special “attributes-only” connector lines?

    I know that these kinds of interfaces can be incredibly powerful: I’ve used them with Shake and Nuke (two high-end film/video compositing programs). I’m just a little skeptical in the data field.

    1. Hi Wayne,

      You make a good point; the integration between R and the GUI needs to be tight. This problem exists in R already and the GUI should not add to it. For example, one of the most fundamental and useful tools missing from base R is the ability to add labels to variables. The Hmisc package has functions that allow you to add variable labels to a vector and to display those labels in output from Hmisc functions. However, the vector will then have two classes: numeric and labeled. The barplot function chokes at the double-class aspect and generates an error message. I’ll join you in hoping that Alteryx’ GUI doesn’t add any bugs of its own to the process flow.

      Cheers,
      Bob

  5. Hi Bob,

    Thanks for the mention and your thoughts on Alteryx. In regards to how we at Alteryx work with and interact with R:

    1. Alteryx does integrate with open source R (and has for just over a year and a half)

    2. We will continue to support open source R integration along with Revolution R Enterprise integration going forward. Our partnership with Revolution Analytics allows us to quickly scale R to large data volumes using their Revo ScaleR technology, and allows us to take advantage of their “compute context” approach for doing within database analytics. Here is a demo of how we work with Revolution – http://www.youtube.com/watch?v=GRcyTHu5RP8

    3. There is a free (as in beer) version of Alteryx available (Alteryx Project Edition) that includes integration with open source R, along with a program to make Alteryx software freely available to academic institutions through our Educational Grant Program. You can download the Project Edition here: http://www.alteryx.com/download?src=sc105

    In addition to easy to use access to R, Alteryx offers a comprehensive set of tools for data manipulation, cleansing, and blending; and can access data from a large number of sources (ranging from traditional file formats, geospatial formats, Hadoop, Teradata, and SQL databases).

    I think Alteryx’s ability to allow non-programmers to handle data from disparate sources in a fast and efficient way will be as important (if not more important) as providing an easy to use interface to R. A series of much more recent demostration videos of Alteryx’s capabilities is available. Two of the demos (A/B Testing and Market Basket Analysis) make use of R-based tools (with all of our R-based tools indicated by the use of an “R” in the lower left-hand corner of their icon). The demo videos on this page also shows how Alteryx can be used with Tableau, including visualizing R-based results in Tableau. Here is the link: http://www.alteryx.com/solutions/analytic-solutions/tableau#demo-gallery

    While Alteryx is a commercial product, we view all of our R related work to be open source. All of the R code we have written is readily accessible by the user, and can be easily altered by the user. In addition, users can develop their own Alteryx tools, allowing new R-based functionality to be brought into the Alteryx interface (Alteryx provides a set of interface development tools that is among the easiest I’ve worked with for creating GUIs for R). Plans are also in place to make, what we think will be, important contributions to the open source R community. At Alteryx, we believe that the future of software development lies in a mixed open source / closed source model. We think there is a real need to move away from black box analytics to transparent box analytics.

    Dan

    1. Hi Dan,

      That’s great news! I had tried your free version briefly but I didn’t see a way to get to R with it. It’s time to dive in more deeply. It would be ideal if there were two sets of analysis nodes for each type of analysis: one that gets the community version and another that gets the Revolution Analytics version. If the nodes controlled as identically as possible, it would make it very easy to transition between the two. As fast as R is growing, we’ll never have nodes for everything, but if anyone can create a new node to go with the R packages they develop, you could have an extensive library of them fairly quickly.

      Cheers,
      Bob

      1. I believe the free version is only good for 15 data runs. To be more specific, any module that has an “output” that could be a file or “report” is cut off after 15 runs. Since Linear Regression has outputs it is cut off fairly quickly – so it isn’t actually anything more than a very short trial. And as you mentioned…you burn through alot of the runs just getting R set up and integrated within Alteryx.

    1. Hi Wes,

      Yes, I covered that on page 51 of my book, R for SAS and SPSS Users (p. 51-52). The “early valiant attempt” link in my blog post points to their old web site, but progress on it seems to have stopped. The old project web site points to the “new” one, which is dead. Also, the Google Group, RedR (https://groups.google.com/forum/#!forum/red-r) hasn’t had a comment on it since 2010. That’s a shame since it got off to such a great start. I really liked how you could pass data through a scatter plot node, select a cluster with your mouse, and only that cluster would pass through to the next step.

      Cheers,
      Bob

      1. Now that you mention is, I think it was your post that led me to look at red-r in the first place. Sorry to hear it petered out, but at least someone is taking up the challenge.

  6. I definitely agree with you. Many great products attract consumers with free strategy at the beginning. After they have more and more clients, they can figure out how to make money more or less.

  7. A couple of years ago, I tried the Kepler scientific workflow system. It worked OK, but passing info back and forth between R functions in different nodes ended up requiring me to write special-purpose code, and I eventually ended up moving back to plain R. Really interesting simulation platform in it’s own right, though, so OK if you’re mostly doing that sort of simulation and just need a little help from R here and there: https://kepler-project.org/users/downloads

  8. Is Alteryx only available on Windows? I thought I’d try following the link… but it just seems to download an .exe file. Just wondering if it just assumes Windows, or it is actually only available on Windows.

  9. Those seeking to follow a graphical/process interface will want to consider using open-source Knime (http://www.knime.org/) with R interfacing activated — Knime is a robust solution for process modelers — Knime interfaces not only with R, but just about anything else out there as well (Excel, LibreOffice Calc, Tableau, ODBC, you name it) — follow the link above for more.

    1. Hi William,

      I saw in the latest Rexer Analytics data mining poll that the use of Knime is growing. However, I’m under the impression that while you can add nodes to your flowchart that contain program code, there there are not pre-defined R nodes already in Knime. Is this accurate? That is what Alteryx is offering.

      Cheers,
      Bob

  10. I miss a module for multiple response variables which are the everyday staple for survey analysts. Unfortunately, I did not find anything.Or did I look in the wrong corners ?

    1. Hi ftr,

      You picked one of my favorite topics. I don’t know if Alteryx does it but I’ll ask. Are you analyzing such data in R now and, if so, what package are you using?

      Cheers,
      Bob

      1. ftr,
        We have several tools that will allow for multinomial responses, which is what I think you mean. Specifically, the rpart function of the included Stats package, functions in the randomForest which implement Breiman’s random forest model, and functions in the gbm package that implement Friedman’s gradient boosted models. We have not yet wrapped the multinom function of the nnet package into an easy to use graphical user interface, but likely will eventually. It can be incorporated into an Alteryx workflow at this moment, as can any other R function, but it requires users to write R code to do it.

        Dan

        1. Hi Dan,

          Our field is filled with so many similarly-named areas! I think he means “check all that apply” type questions that are popular on surveys. Like,

          “What R packages do you use (check all that apply)?
          [ ] lubridate
          [ ] plyr
          [ ] stringr

          If you get frequencies or crosstabs on them, the total count is the number of items checked rather than the number of subjects who responded. By an interesting coincidence, this package handles them, and it was just published yesterday! (10/24/2013) http://cran.r-project.org/web/packages/MRCV/index.html

          Cheers,
          Bob

          1. Bob,

            You understood my question and provide an answer that provides more than I demanded ! This is the first time that I see such an approach and I shall have a deep to look into manual on CRAN. The statistical difficulty with multiple response variables is that a case (a respondent) can have more than one answer. And the answers are categorical variables.

            Cordialement Frank

Leave a Reply to WesCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.