R GUI Reviews Updated

I have just finished updating my reviews of graphical user interfaces for the R language. These include BlueSky Statistics, jamovi, JASP, R AnalyticFlow, R Commander, R-Instat, Rattle, and RKward. The permanent link to the article that summarizes it all is https://r4stats.com/articles/software-reviews/r-gui-comparison/.
I list the highlights below as this post to reach all the blog aggregators. If you have suggestions for improving any of the reviews, please let me know at muenchen.bob@gmail.com.

With so many detailed reviews of Graphical User Interfaces (GUIs) for R available, which should you choose? It’s not too difficult to rank them based on the number of features they offer, so I’ll start there. Then, I’ll follow with a brief overview of each.

I’m basing the counts on the number of dialog boxes in each category of the following categories:

  • Ease of Use
  • General Usability
  • Graphics
  • Analytics
  • Reproducibility

This data is trickier to collect than you might think. Some software has fewer menu choices, depending on more detailed dialog boxes instead. Studying every menu and dialog box is very time-consuming, but that is what I’ve tried to do to keep this comparison trustworthy. Each development team has had a chance to look the data over and correct errors.

Perhaps the biggest flaw in this methodology is that every feature adds only one point to each GUI’s total score. I encourage you to download the full dataset to consider which features are most important to you. If you decide to make your own graphs with a different weighting system, I’d love to hear from you in the comments below.

Ease of Use

For ease of use, I’ve defined it primarily by how well each GUI meets its primary goal: avoiding code. They get one point for each of the following abilities, which include being able to install, start, and use the GUI to its maximum effect, including publication-quality output, without knowing anything about the R language itself. Figure one shows the result. R Commander is abbreviated Rcmdr, and R AnalyticFlow is abbreviated RAF. The commercial BlueSky Pro comes out on top by a slim margin, followed closely by JASP and RKWard. None of the GUIs achieved the highest possible score of 14, so there is room for improvement.

  • Installs without the use of R
  • Starts without the use of R
  • Remembers recent files
  • Hides R code by default
  • Use its full capability without using R
  • Data editor included
  • Pub-quality tables w/out R code steps
  • Simple menus that grow as needed
  • Table of Contents to ease navigation
  • Variable labels ease identification in the output
  • Easy to move blocks of output
  • Ease reading columns by freezing headers of long tables
  • Accepts data pasted from the clipboard
  • Easy to move header row of pasted data into the variable name field
Figure 1. The number of ease of use features offered by each R GUI.

General Usability

This category is dominated by data-wrangling capabilities, where data scientists and statisticians spend most of their time. It also includes various types of data input and output. We see in Figure 2 that both BlueSky versions and R-Instat come out on top not just due to their excellent selection of data-wrangling features but also for their use of the rio package for importing and exporting files. The rio package combines the import/export capabilities of many other packages, and it is easy to use. I expect the other GUIs will eventually adopt it, raising their scores by around 20 points.

  • Operating systems (how many)
  • Import data file types (how many)
  • Import from databases (how many)
  • Export data file types (how many)
  • Languages displayable in UI (how many, besides English)
  • Easy to repeat any step by groups (split-file)
  • Multiple data files open at once
  • Multiple output windows
  • Multiple code windows
  • Variable metadata view
  • Variable types (how many)
  • Variable search/filter in dialogs
  • Variable sort by name
  • Variable sort by type
  • Variable move manually
  • Model Builder (how many effect types)
  • Magnify GUI for teaching
  • R code editor
  • Comment/uncomment blocks of code
  • Package management (comes with R and all packages)
  • Output: word processing features
  • Output: R Markdown
  • Output: LaTeX
  • Data wrangling (how many)
  • Transform across many variables at once (e.g., row mean)
  • Transform down many variables at once (e.g., log, sqrt)
  • Assign factor labels across many variables at once
  • Project saves/loads data, dialogs, and notes in one file
Figure 2. The number of general usability features in each R GUI.

Graphics

This category consists mainly of the number of graphics each software offers. However, the other items can be very important to completing your work. They should add more than one point to the graphics score, but I scored them one point since some will view them as very important while others might not need them at all. Be sure to see the full reviews or download the Excel file if those features are important to you. Figure 3 shows the total graphics score for each GUI. R-Instat has a solid lead in this category. In fact, this underestimates R-Instat’s ability if you include its options to layer any “geom” on top of another graph. However, that requires knowing the geoms and how to use them. That’s knowledge of R code, of course.

When studying these graphs, it’s important to consider the difference between the relative and absolute performance. For example, relatively speaking, R Commander is not doing well here, but it does offer over 25 types of plots! That absolute figure might be fine for your needs.

Continued…

BlueSky Statistics Version 10 is Not Open Source

BlueSky Statistics is a graphical user interface for the powerful R language. On July 10, 2024, the BlueskyStatistics.com website said:

“…As the BlueSky Statistics version 10 product evolves, we will continue to work on orchestrating the necessary logistics to make the BlueSky Statistics version 10.x application available as an open-source project. This will be done in phases, as we did for the BlueSky Statistics 7.x version. We are currently rearchitecting its key components to allow the broader community to make effective contributions. When this work is complete, we will open-source the components for broader community participation…”

In the current statement (September 5, 2024), the sentence regarding version 10.x becoming open source is gone. This line was added:

“…Revenue from the commercial (Pro) version plays a vital role in funding the R&D needed to continue to develop and support the open-source (BlueSky Statistics 7.x) version and the free version (BlueSky Statistics 10.x Base Edition)…”

I have verified with the founders that they no longer plan to release version 10 with an open-source license. I’m disappointed by this change as I have advocated for and written about open source for many years.

There are many advantages of open-source licensing over proprietary. If the company decides to stop making version 10 free, current users will still have the right to run the currently installed version, but they will only be able to get the next version if they pay. If it were open source, its users could move the code to another repository and base new versions on that. That scenario has certainly happened before, most notably with OpenOffice. BlueSky LLC has announced no plans to charge for future versions of BlueSky Base Edition, but they could.

I have already updated the references on my website to reflect that BlueSky v10 is not open source. I wish I had been notified of this change before telling many people at the JSM 2024 conference that I was demonstrating open-source software. I apologize to them.

BlueSky Statistics Enhancements

BlueSky Statistics is a free and open-source graphical user interface for the powerful R language. There is also a commercial “Pro” version that offers tech support, priority feature requests, and many powerful additional features. The Pro version has been beefed up considerably with the new features below. These features apply to quality control, general statistics, team collaboration, project management, and scripting. Many are focused on quality control and Six Sigma as a result of requests from organizations migrating from Minitab and JMP. However, both versions of BlueSky Statistics offer a wide range of statistical, graphical, and machine-learning methods.

The free version saves every step of the analysis for full reproducibility. However, repeating the analysis is a step-by-step process. The Pro version can now rerun the entire set at once, substituting other datasets when needed.

You can obtain either version at https://BlueSkyStatistics.com. A detailed review is available at https://r4stats.com/articles/software-reviews/bluesky/. If you plan to attend the Joint Statistical Meetings (JSM) in Portland next week, stop by Booth 406 to get a demonstration. We hope to see you there!

Copy and Paste data from Excel

      Copy data from Excel and paste it into the BlueSky Statistics data grid. This is in addition to the existing mechanism of bringing data through file import for various file formats into BlueSky Statistics to perform data analysis.

      Undo/Redo data grid edits

      Single-item and muti-items data element edits can be discarded by undo and restored by redo operations.

      Project save/open to save/open all work (all open datasets and output analysis)

      Analysis performed can be saved into one or more projects. Each project contains all the datasets along with all the analyses and any R code from the editor. The projects can be exported and shared (sent as .bsp, which is a zip file; “bsp” is an abbreviation of BlueSky Statistics Project) with other BlueSky Statistics users. The users can import projects, see all the datasets and analyses stored in the projects, and subsequently add/modify/rerun all the analyses.

      Enhanced cleaning/adjustment of copied/imported Excel/CSV data on the Datagrid

      Dataset > Excel Cleanup

      There are a few enhancements made to offer additional data cleanup/adjustment options to the existing Excel Cleanup dialog to clean/adjust (i.e., rows. Columns, data type, etc.) data on the BlueSky Statistics data grid, irrespective of how the data was loaded into the data grid with the file open option or by copying and pasting from Excel/CSV file.

      Renaming output tabs

      Double-clicking on the output tab will open a dialog box asking for the new name. The user can type in a name to rename the output tab.

      Enhanced Pie Chart and Bar Chart

      Graphics > Pie Charts > Pie Chart
      Graphics > Bar Chart

                    The pie chart and bar chart have been enhanced to show % and counts on the plot.

      Scatterplot Matrix

      Graphics > Scatterplot Matrix

      The Scatter Plot Matrix dialog has been added.

      Scatterplot with mean and confidence interval bar

      Graphics > Scatterplot > Scatter Plot with Intervals

      A Scatter Plot dialog with mean and confidence interval bar has been made available with an unlimited number of grouping variables for the X-axis to group a numeric variable for the Y-axis.

      Enhanced Scatterplot with both horizontal and vertical reference lines

      Graphics > Scatterplot > Scatter Plot Ref Lines

      The Scatterplot dialog has been enhanced so that users can add an unlimited number of reference lines (horizontal and vertical axis) to the plot. 

      Enhancements to BlueSky Statistics R Editor and Output Syntax/Code Editor

      For R-programmers many enhancements have been made to the BlueSky R Editor and the output syntax/code editor to improve ease of use and productivity with tooltips, find and replace, undo/redo, comment/uncomment blocks, etc.

      Enhanced Normal Distribution Plot

        Distribution > Normal > Normal Distribution Plot with Labels 

        • The normal distribution plot will show the computed probability and x values on the plot for the shaded area for x value and quantiles, respectively
        • Plot one tail (left and right), two tails, and other ranges

        Automatic randomization of generating normal sample distribution

          Distribution > Normal > Sample from Normal Distribution

          In addition to setting a seed value for reproducibility, the default option has been set to randomize automatically the sample data generation every time.

          Automatic randomization of design creations of all DoE designs

            DOE > Create Design > …. 

            In addition to setting a seed value for reproducibility, the default option has been set to randomize the creation of any DoE design every time automatically.

            Enhanced Distribution Fit analysis

              Analysis > Distribution Analysis > Distribution Fit P-value

              The distribution fit analysis has been enhanced to compute AD, KS, and CVM tests and show test statistics, as well as corresponding p-values. These assist users in determining the best fit in addition to the existing AIC and BIC values.

              Moreover, an option has been introduced for users to see only the comparison of distributions and skip displaying the analysis of the individual distribution fit analysis.

              Tolerance Intervals

                Six Sigma > Tolerance Intervals

                A new Tolerance Intervals analysis has been introduced. The tolerance interval describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. These tolerance limits, taken from the estimated interval, are limits within which a stated proportion of the population is expected to occur.

                Equivalence (and Minimal Effect) test

                  Analysis > Means > Equivalence test

                  This new feature tests for mean equivalence and minimal effects.

                  Nonlinear Least Square – all-purpose Non-Linear Regression modeling

                    Model Fitting > Nonlinear Least Square

                    Performs non-linear regression with flexibility and many user options to model, test, and plot.

                    Polynomial Models with different degrees

                      Model Fitting > Polynomial

                      Computes and fits an orthogonal polynomial model with a specified degree. Also, optionally compares multiple Polynomial models of different degrees side by side. 

                      Enhanced Pareto Chart

                        Six Sigma > Pareto Chart > Pareto Chart

                        A new option has been added for data that does not have a count column but only has the raw data. Automatically computes cumulative frequency from Raw Data for plotting.

                        Frequency analysis with an option to draw a Pareto chart

                          Analysis > Summary > Frequency Plot

                          A new dialog has been introduced to plot (optionally) the Pareto Chart from the frequency table and, if desired, display the frequency table on the Datagrid.

                          MSA (Measurement System Analysis) Enhancements

                          Gage Study Design Table

                          Six Sigma > MSA > Design MSA Study

                          Users can generate a randomized design experiment table for any combination of the number of operators, parts, and replications to set up a Gage study table to perform experiments and collect the results to analyze the accuracy of the Gage under study with analysis like Gage R&R, Gage Bias, etc.

                          Enhanced Gage R&R

                          Six Sigma > MSA > Gage R&R

                          Many enhancements and options have been introduced to the Gage of R&R dialog and the underlying analysis

                          • Report header table
                          • Enlarged graphs
                          • Nested gage data analysis, in addition to crossed
                          • Usage of historical process std dev to estimate Gage Evaluation values (%StudyVar table)
                          • Show %Process

                          Enhanced Gage Attribute Analysis

                          Six Sigma > MSA > Attribute Analysis

                          Many enhancements and options have been introduced to the Attribute Analysis dialog and the underlying analysis

                          • Report header table
                          • Accuracy and classification rate calculations, in addition to agreement and disagreement
                          • Optional Cohen’s Kappa stats (between each pair of raters) in addition to Fleiss Kappa (multi-raters)

                          Enhanced Gage Bias Analysis

                          Six Sigma > MSA > Gage Bias Analysis

                          Many enhancements and options have been introduced to the Gage Bias Analysis dialog and the underlying analysis

                          • Efficient single dialog with options for linearity and type-1 tests for one or more References
                          • A new option – “Method to use for estimating repeatability std dev”
                          • Cg and Cgk – calculated for different Reference values in one go
                          • Run charts for every reference value and an overall run chart for all reference values
                          • Usage of historical std dev to calculate RF (Reference Figure)
                          • %RE, %EV are introduced, and all tables show how the computed values compared to the required/cut-off values specified by users on the dialog

                          PCA (Process Capability Analysis) Enhancements

                          Enhanced Process Capability Analysis (for normal data)

                          Six Sigma > Process Capability > Process Capability

                          •  pp_l = pp_k and ppU = ppk is shown when a one-sided tolerance is used
                          • Removed underscores to only show Ppl, Ppk, Ppu, Cp, Cpk, .. etc
                          • A new option – “Do not use unbiasing constant to estimate std dev for overall process capability indices” to compute overall Ppk (Ppl)
                          • Underlying charts (xbar.one) renamed to MR or I Chart based on SD or MR
                          • Handling of missing values
                          • Customizable number of decimals to show on the plot
                          • Standard Deviation label on the plot marked as “Overall StdDev” and “Within StdDev”

                          Process Capability Analysis for non-normal data

                          Six Sigma > Process Capability > Process Capability (Non-Normal)

                          A new dialog has been introduced to perform process capability analysis for non-normal data.

                          Multi-Vari graph

                          Six Sigma > Multi-Vari Chart

                          A new option has been added to adjust horizontal and vertical position offset to place/move the values for the data points on the plot.

                          Enhanced Shewhart Charts

                          Six Sigma > Shewhart Charts > …….

                          A new option has been added to all Shewhart Charts dialogs: the ability to add any number of spec/reference lines to the chart specified by the user.

                          Minitab Alternative BlueSky Statistics to Display Graphical Interface to R at ASQ Conference

                          BlueSky Statistics exhibit at the American Society for Quality World Conference on Quality and Improvement in San Diego from 12 to 14 May, Booth #720 (https://asq.org/conferences/wcqi/solution-partners).

                          BlueSky Statistics is an R-based software for statistics, data science, Six Sigma, DoE, and more. An alternative to Minitab and similar software, it is designed for quality professionals to identify areas for process and quality improvement and eliminate waste to drive continuous improvement with process capability analysis (PCA), control charts, measurement system analysis (MSA), design of experiments (DoE), distribution analysis, and many more statistical methods.

                          See BlueSky Statistics GUI for R at JSM 2023

                          Are attending this year’s Joint Statistical Meetings in Toronto? If so, stop by booth 404 to see the latest features of BlueSky Statistics. A menu-based graphical user interface for the R language, BlueSky lets people access the power of R without having to learn to program. Programmers can easily add code to BlueSky’s menus, sharing their expertise with non-programmers. My detailed review of BlueSky is here, a brief comparison to other R GUIs is here, and the BlueSky User Guide is here. I hope to see you in Toronto! [Epilog: at the meeting, I did not know the company had decided to keep the latest version closed-source. Sorry to those I inadvertently misled at the conference.]

                          Update to Data Science Software Popularity

                          I’ve updated The Popularity of Data Science Software‘s market share estimates based on scholarly articles. I posted it below, so you don’t have to sift through the main article to read the new section.

                          Scholarly Articles

                          Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool or even as an object of study.

                          Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  

                          Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 4,500 articles) in the most recent complete year, 2022.

                          Figure 2a. The number of scholarly articles found on Google Scholar for data science software. Only those with more than 4,500 citations are shown.

                          SPSS is the most popular package, as it has been for over 20 years. This may be due to its balance between power and its graphical user interface’s (GUI) ease of use. R is in second place with around two-thirds as many articles. It offers extreme power, but as with all languages, it requires memorizing and typing code. GraphPad Prism, another GUI-driven package, is in third place. The packages from MATLAB through TensorFlow are roughly at the same level. Next comes Python and Scikit Learn. The latter is a library for Python, so there is likely much overlap between those two. Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest. Old stalwart FORTRAN appears last in this plot. While its count seems close to zero, that’s due to the wide range of this scale, and its count is just over the 4,500-article cutoff for this plot.

                          Continuing on this scale would make the remaining packages appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 4,500 rather than the 110,000 used in Figure 2a. I chose that cutoff value because it allows us to see two related sets of tools on the same plot: workflow tools and GUIs for the R language that make it work much like SPSS.

                          Figure 2b. Number of scholarly articles using each data science software found using Google Scholar. Only those with fewer than 4,500 citations are shown.

                          JASP and jamovi are both front-ends to the R language and are way out front in this category. The next R GUI is R Commander, with half as many citations. Still, that’s far more than the rest of the R GUIs: BlueSky Statistics, Rattle, RKWard, R-Instat, and R AnalyticFlow. While many of these have low counts, we’ll soon see that the use of nearly all is rapidly growing.

                          Workflow tools are controlled by drawing 2-dimensional flowcharts that direct the flow of data and models through the analysis process. That approach is slightly more complex to learn than SPSS’ simple menus and dialog boxes, but it gets closer to the complete flexibility of code. In order of citation count, these include RapidMiner, KNIME, Orange Data Mining, IBM SPSS Modeler, SAS Enterprise Miner, Alteryx, and R AnalyticFlow. From RapidMiner to KNIME, to SPSS Modeler, the citation rate approximately cuts in half each time. Orange Data Mining comes next, at around 30% less. KNIME, Orange, and R Analytic Flow are all free and open-source.

                          While Figures 2a and 2b help study market share now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each software, but collecting that much data is too time-consuming. Instead, I’ve collected data only for the years 2019 and 2022. This provides the data needed to study growth over that period.

                          Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side) and the declining or “cooling” ones shown in blue (left side).

                          Figure 2c. Change in Google Scholar citation rate from 2019 to the most recent complete year, 2022. BlueSky (2,960%) and jamovi (452%) growth figures were shrunk to make the plot more legible.

                          Seven of the 14 fastest-growing packages are GUI front-ends that make R easy to use. BlueSky’s actual percent growth was 2,960%, which I recoded as 220% as the original value made the rest of the plot unreadable. In 2022 the company released a Mac version, and the Mayo Clinic announced its migration from JMP to BlueSky; both likely had an impact. Similarly, jamovi’s actual growth was 452%, which I recoded to 200. One of the reasons the R GUIs were able to obtain such high percentages of change is that they were all starting from low numbers compared to most of the other software. So be sure to look at the raw counts in Figure 2b to see the raw counts for all the R GUIs.

                          The most impressive point on this plot is the one for PyTorch. Back on 2a we see that PyTorch was the fifth most popular tool for data science. Here we see it’s also the third fastest growing. Being big and growing fast is quite an achievement!

                          Of the workflow-based tools, Orange Data Mining is growing the fastest. There is a good chance that the next time I collect this data Orange will surpass SPSS Modeler.

                          The big losers in Figure 2c are the expensive proprietary tools: SPSS, GraphPad Prism, SAS, BMDP, Stata, Statistica, and Systat. However, open-source R is also declining, perhaps a victim of Python’s rising popularity.

                          I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d, I have plotted the same scholarly-use data for 1995 through 2016.

                          Figure 2d. The number of Google Scholar citations for each classic statistics package per year from 1995 through 2016.

                          SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009, and its use is in sharp decline. SAS never came close to SPSS’s level of dominance, and its usage peaked around 2010. GraphPad Prism followed a similar pattern, though it peaked a bit later, around 2013.

                          In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 2015. The result is shown in Figure 2e.

                          Figure 2e. The number of Google Scholar citations for each classic statistics package from 1995 through 2016, with SPSS removed and SAS included only in 2014 and 2015. The removal of SPSS and SAS expanded scale makes it easier to see the rapid growth of the less popular packages.

                          Figure 2e shows that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. The decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this graph.

                          These results apply to scholarly articles in general. The results in specific fields or journals are likely to differ.

                          You can read the entire Popularity of Data Science Software here; the above discussion is just one section.

                          Updated Comparison of R Graphical User Interfaces

                          I have just updated my detailed reviews of Graphical User Interfaces (GUIs) for R, so let’s compare them again. It’s not too difficult to rank them based on the number of features they offer, so let’s start there. I’m basing the counts on the number of dialog boxes in each category of four categories:

                          • Ease of Use
                          • General Usability
                          • Graphics
                          • Analytics

                          This is trickier data to collect than you might think. Some software has fewer menu choices, depending instead on more detailed dialog boxes. Studying every menu and dialog box is very time-consuming, but that is what I’ve tried to do. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own categories. If you decide to make your own graphs, I’d love to hear from you in the comments below.

                          Figure 1 shows how the various GUIs compare on the average rank of the four categories. R Commander is abbreviated Rcmdr, and R AnalyticFlow is abbreviated RAF. We see that BlueSky is in the lead with R-Instat close behind. As my detailed reviews of those two point out, they are extremely different pieces of software! Rather than spend more time on this summary plot, let’s examine the four categories separately.

                          Figure 1. Mean of each R GUI’s ranking of the four categories. To make this plot consistent with the others below, the larger the rank, the better.

                          For the category of ease-of-use, I’ve defined it mostly by how well each GUI does what GUI users are looking for: avoiding code. They get one point each for being able to install, start, and use the GUI to its maximum effect, including publication-quality output, without knowing anything about the R language itself. Figure two shows the result. JASP comes out on top here, with jamovi and BlueSky right behind.

                          Figure 2. The number of ease-of-use features that each GUI has.

                          Figure 3 shows the general usability features each GUI offers. This category is dominated by data-wrangling capabilities, where data scientists and statisticians spend most of their time. This category also includes various types of data input and output. BlueSky and R-Instat come out on top not just due to their excellent selection of data wrangling features but also due to their use of the rio package for importing and exporting files. The rio package combines the import/export capabilities of many other packages, and it is easy to use. I expect the other GUIs will eventually adopt it, raising their scores by around 40 points. JASP shows up at the bottom of this plot due to its philosophy of encouraging users to prepare the data elsewhere before importing it into JASP.

                          Figure 3. Number of general usability features for each GUI.

                          Figure 4 shows the number of graphics features offered by each GUI. R-Instat has a solid lead in this category. In fact, this underestimates R-Instat’s ability if you…

                          Continued…

                          Rexer Analytics Survey Results

                          Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.

                          Surveys of Use

                          One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics conducts such a survey every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 6a shows the tools that the 1,220 respondents reported using in 2015.

                          Figure 6a. Analytics tools used.
                          Figure 6a. Analytics tools used by respondents to the Rexer Analytics Survey. In this view, each respondent was free to check multiple tools.

                          We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.

                          The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be over counting. However, the survey also asked,  “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?”  Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 6b. Since other software combinations are likely, e.g. SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.

                          Figure 6b. The percent of survey respondents who checked each package as their primary tool.
                          Figure 6b. The percent of survey respondents who checked each package as their primary tool. Note that free and commercial versions of KNIME and RapidMiner are combined. Multiple tools from the same company are also combined. Only the top 10 are shown.

                          In this view we see R even more dominant, with over a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g. research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration and performance, rather than limiting analytic features or access to popular data formats.

                          If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

                          Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

                          Learning R: Live Webinar, Interactive Self-Paced, or Site Visit?

                          My recent blog post, Why R is Hard to Learn, must have hit a nerve as it was read by over 6,000 people in its first two days online.  If you’re using R to augment your work in SAS, SPSS or Stata or you’re considering switching to R, my workshops can help minimize many of those headaches by pointing out the commands and options that frustrate users of those packages the most. Also find out which of the thousands of R packages will give you the output you’re most used to.

                          My next two live webinars done in partnership with Revolution Analytics are in January:
                          R for SAS, SPSS and Stata Users
                          Managing Data with R (updated to include dplyr, broom, tidyr, etc.)
                          Course outlines and registration for both is here.

                          My R for SAS, SPSS and Stata Users workshop is also now available as a self-paced interactive video workshop at DataCamp.com.

                          I do site visits in partnership with RStudio.com, whose software I recommend and use in every form of my training.  If your company does its training through Xerox Learning Services, I also partner with them. For further details or to arrange a site visit, you can reach me at muenchen.bob@gmail.com.

                          Specifying Variables in R

                          R has several ways to specify which variables to use in an analysis. Some of the most frustrating errors can result from not understanding the order in which R searches for variables. This post demonstrates that order, hopefully smoothing your future use of R.

                          If all your variables are vectors in your workspace, using them in an analysis is easy: simply name them. For example, you could build a linear model (regression) using the lm function like this:

                          lm(y ~ x)

                          However, data frames exist for a good reason. They help organize variables and keep the values of each observation (the rows) locked together. For example, when you sort a data frame, all the rows of a data frame are moved, not just the single variable you’re sorting on. Once variables are stored in a data frame however, referring to them gets more complicated. R can include variables from multiple places (e.g. two data frames or a data frame and the workspace) so it becomes important to know your options and how R views them.

                          You can specify the names of both a data frame and a variable using the compound forms mydata$myvar or mydata[“myvar”]. However, that often means that you have to type the name of the data frame quite a lot.

                          If you use the form “with(mydata,…” then R will look in that data frame for the “short” variable names before it looks elsewhere, like in your workspace. That allows you to type the data frame name only once per function call, but in a long program you would still end up typing it a lot.

                          Modeling functions in R often let you specify “data = mydata” allowing you to use short variable names in formulas like “y ~ x”. The result is like the “with” function, you must type the data frame name once per function call. (SAS users take note: variables used outside of formulas will not be found with this approach!)

                          Finally, you can attach the data frame with “attach(mydata)”. This copies the variables into a temporary space that lets you then refer to them by their short names. This has the big advantage of allowing all the following function calls to use short variable names. Unfortunately, it has the big disadvantage of being confusing. Confusion #1 is that people feel that variables they create will go into the data frame automatically; they will not. Unless you specify a data frame using either mydata$newvar or mydata[“newvar”], new variables are created in your workspace. Confusion #2 is that R will look in your workspace before it looks at the attached versions of variables. So if variables with the same names exist there, those will be used instead. Confusion #3 is that even though detach(mydata) will reverse the process, if you run your program multiple times, you may have attached the data multiple times and detaching once does not fully undo the attached state. As confusing at that is, I use attach frequently and rarely get burned by it.

                          For example, with variables x and y stored in mydata (and nowhere else) you could do a linear regression model using any one of these approaches:

                          lm(mydata$y ~ mydata$x)

                          lm(mydata[“y”] ~ mydata[“x”])

                          with(mydata, lm(y ~ x))

                          lm(y ~ x, data = mydata)

                          attach(mydata)
                          lm(y ~ x)

                          As if that weren’t complicated enough, both x and y do not have to both be in the same data frame! The x variable could be in mydata and the y variable could be in the workspace or in an attached version of mydata or some other data frame. That would be dangerous, of course, since it would be up to you to ensure that the values of each observation match or the resulting model would be nonsense. However, this kind of flexibility can also be very useful.

                          With all this flexibility, it’s important to know the order in which R chooses variables. A simple example can show us the order R uses. Here I am creating four data frames whose x and y variables will have a slope that is indicated by the data frame name. For example, the variables in df10 have a slope of 10. This will make it easy for us to see which version of the variables R is using.

                          > y <- c(1,2,3,4,5,6,7,8,9,10)
                          > x <- c(1,2,5,5,5,5,5,8,9,10)
                          > df1    <- data.frame(x, y)     
                          > df10   <- data.frame(x, y = y*10  )
                          > df100  <- data.frame(x, y = y*100 )
                          > df1000 <- data.frame(x, y = y*1000)
                          > rm(y, x)
                          > ls()
                          [1] "df1"    "df10"   "df100"  "df1000"

                          Notice that I have deleted the original x and y variables so at the moment, varibles x and y exist only within the data frames. Running a regression with lm(y ~ x) will not work since R does not look into data frames unless you tell it to. Even if it did, it would have no way to know which set of x’s and y’s to use. Next I will take two different approaches to “selecting” a data frame. I attach df1 and copy the variables from df10 into the workspace.

                          > attach(df1)
                          > y <- df10$y
                          > x <- df10$x

                          Next, I do something rarely useful, calling a linear model using both “with” and “data=”. Which will dominate?

                          > with(df100, lm(y ~ x, data = df1000))
                          
                          Call:
                          lm(formula = y ~ x, data = df1000)
                          
                          Coefficients:
                          (Intercept)            x  
                                    0         1000

                          Since the slope is 1000, it’s clear that the “data=” argument was dominant. So R would look there first. If it found both x and y, it would stop looking. But if it only found one variable, it would continue to look elsewhere for the other. If the other variable where in the “with” data frame, it would then use it.

                          Next I’ll remove the “data” argument and see what happens.

                          > with(df100, lm(y ~ x))
                          
                          Call:
                          lm(formula = y ~ x)
                          
                          Coefficients:
                          (Intercept)            x  
                                    0          100

                          This time the “with” data frame was used for both variables. If variable either had not been in that data frame, R would have continued to look in the workspace and in the attached copy. But which would it use first? Next, I’m not specifying a data frame at all.

                          > lm(y ~ x)
                          
                          Call:
                          lm(formula = y ~ x)
                          
                          Coefficients:
                          (Intercept)            x  
                                    0           10

                          The slope of 10 tells us that it found the copies of x and y that I copied from df10 into the workspace. Let’s delete those variables and list the objects in our workspace to ensure that they’re gone.

                          > rm(y, x)
                          > ls()
                          [1] "df1"    "df10"   "df100"  "df1000"

                          Both x and y are clearly gone. So lets see if we can still use them.

                          > lm(y ~ x)
                          
                          Call:
                          lm(formula = y ~ x)
                          
                          Coefficients:
                          (Intercept)            x  
                                    0            1

                          We deleted x and y but we can still use them! However, we see from the slope of 1 that R has used a different pair of x and y variables. They’re the ones that were copied to my search path when I used “attach(myDf1)”. I had to remember that I had attached them. It’s this kind of confusion that makes many R users avoid using attach. Finally, I’ll detach df1 and see what happens.

                          > detach(df1)
                          > lm(y ~ x)
                          Error in eval(expr, envir, enclos) : object 'y' not found

                          Now, even though all the data frames in our workspace contain an x and y variable, R does not look inside to find any of them. Even if it did, it would have no way of know which to choose.

                          We have seen that R looks in various places for variables. In order, they are: what you specify in “data=”, using “with(mydata,…”, your workspace and finally attached copies of your data frame. The most recently attached copies are the ones it will use first. I hope this will help you use R with both less typing and less confusion.