Data Science Software Popularity Update

I have recently updated my extensive analysis of the popularity of data science software. This update covers perhaps the most important section, the one that measures popularity based on the number of job advertisements. I repeat it here as a blog post, so you don’t have to read the entire article.

Job Advertisements

One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what will become more popular in the future.

Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their  co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities.

Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g., scikit-learn, Apache Spark), while others are used in data science jobs and, more broadly, in report-writing jobs (e.g., SAS, Tableau). General-purpose languages (e.g., Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that require them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the results in this section use those procedures to make the required queries.

I collected the job counts discussed in this section on October 5, 2022. To measure percent change, I compare that to data collected on May 27, 2019. One might think that a sample on a single day might not be very stable, but they are. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002. I occasionally double-check some counts a month or so later and always get similar figures.

The number of jobs covers a very wide range from zero to 164,996, with a mean of 11,653.9 and a median of 845.0. The distribution is so skewed that placing them all on the same graph makes reading values difficult. Therefore, I split the graph into three, each with a different scale. A single plot with a logarithmic scale would be an alternative, but when I asked some mathematically astute people how various packages compared on such a plot, they were so far off that I dropped that approach.

Figure 1a shows the most popular tools, those with at least 10,000 jobs. SQL is in the lead with 164,996 jobs, followed by Python with  150,992 and Java with 113,944. Next comes a set from C++/C# at 48,555, slowly declining to Microsoft’s Power BI at 38,125. Tableau, one of Power BI’s major competitors, is in that set. Next comes R and SAS, both around 24K jobs, with R slightly in the lead. Finally, we see a set slowly declining from MATLAB at 17,736 to Scala at 11,473.

Figure 1a. Number of data science jobs for the more popular software (>= 10,000 jobs).

Figure 1b covers tools for which there are between 250 and 10,000 jobs. Alteryx and Apache Hive are at the top, both with around 8,400 jobs. There is quite a jump down to Databricks at 6,117 then much smaller drops from there to Minitab at 3,874. Then we see another big drop down to JMP at 2,693 after which things slowly decline until MLlib at 274.

Figure 1b. Number of jobs for less popular data science software tools, those with between 250 and 10,000 jobs.

The least popular set of software, those with fewer than 250 jobs, are displayed in Figure 1c. It begins with DataRobot and SAS’ Enterprise Miner, both near 182. That’s followed by Apache Mahout with 160, WEKA with 131, and Theano at 110. From RapidMiner on down, there is a slow decline until we finally hit zero at WPS Analytics. The latter is a version of the SAS language, so advertisements are likely to always list SAS as the required skill.

Figure 1c. Number of jobs for software having fewer than 250 advertisements.

Several tools use the powerful yet easy workflow interface: Alteryx, KNIME, Enterprise Miner, RapidMiner, and SPSS Modeler. The scale of their counts is too broad to make a decent graph, so I have compiled those values in Table 1. There we see Alteryx is extremely dominant, with 30 times as many jobs as its closest competitor, KNIME. The latter is around 50% greater than Enterprise Miner, while RapidMiner and SPSS Modeler are tiny by comparison.

SoftwareJobs
Alteryx8,566
KNIME281
Enterprise Miner181
RapidMiner69
SPSS Modeler17
Table 1. Job counts for workflow tools.

Let’s take a similar look at packages whose traditional focus was on statistical analysis. They have all added machine learning and artificial intelligence methods, but their reputation still lies mainly in statistics. We saw previously that when we consider the entire range of data science jobs, R was slightly ahead of SAS. Table 2 shows jobs with only the term “statistician” in their description. There we see that SAS comes out on top, though with such a tiny margin over R that you might see the reverse depending on the day you gather new data. Both are over five times as popular as Stata or SPSS, and ten times as popular as JMP. Minitab seems to be the only remaining contender in this arena.

SoftwareJobs only for “Statistician”
SAS1040
R1012
Stata176
SPSS146
JMP93
Minitab55
Statistica2
BMDP3
Systat0
NCSS0
Table 2. Number of jobs for the search term “statistician” and each software.

Next, let’s look at the change in jobs from the 2019 data to now (October 2022), focusing on software that had at least 50 job listings back in 2019. Without such a limitation, software that increased from 1 job in 2019 to 5 jobs in 2022 would have a 500% increase but still would be of little interest. Percent change ranged from -64.0% to 2,479.9%, with a mean of 306.3 and a median of 213.6. There were two extreme outliers, IBM Watson, with apparent job growth of 2,479.9%, and Databricks, at 1,323%. Those two were so much greater than the rest that I left them off of Figure 1d to keep them from compressing the remaining values beyond legibility. The rapid growth of Databricks has been noted elsewhere. However, I would take IBM Watson’s figure with a grain of salt as its growth in revenue seems nowhere near what the Indeed.com’s job figure seems to indicate.

The remaining software is shown in Figure 1d, where those whose job market is “heating up” or growing are shown in red, while those that are cooling down are shown in blue. The main takeaway from this figure is that nearly the entire data science software market has grown over the last 3.5 years. At the top, we see Alteryx, with a growth of 850.7%. Splunk (702.6%) and Julia (686.2%) follow. To my surprise, FORTRAN follows, having gone from 195 jobs to 1,318, yielding growth of 575.9%! My supercomputing colleagues assure me that FORTRAN is still important in their area, but HPC is certainly not growing at that rate. If any readers have ideas on why this could occur, please leave your thoughts in the comments section below.

Figure 1d. Percent change in job listings from March 2019 to October 2022. Only software that had at least 50 jobs in 2019 is shown. IBM (2,480%) and Databricks (1,323%) are excluded to maintain the legibility of the remaining values.

SQL and Java are both growing at around 537%. From Dataiku on down, the rate of growth slows steadily until we reach MLlib, which saw almost no change. Only two packages declined in job advertisements, with WEKA at -29.9%, Theano at -64.1%.

This wraps up my analysis of software popularity based on jobs. You can read my ten other approaches to this task at https://r4stats.com/articles/popularity/. Many of those are based on older data, but I plan to update them in the first quarter of 2023, when much of the needed data will become available. To receive notice of such updates, subscribe to this blog, or follow me on Twitter: https://twitter.com/BobMuenchen.

Updated Comparison of R Graphical User Interfaces

I have just updated my detailed reviews of Graphical User Interfaces (GUIs) for R, so let’s compare them again. It’s not too difficult to rank them based on the number of features they offer, so let’s start there. I’m basing the counts on the number of dialog boxes in each category of four categories:

  • Ease of Use
  • General Usability
  • Graphics
  • Analytics

This is trickier data to collect than you might think. Some software has fewer menu choices, depending instead on more detailed dialog boxes. Studying every menu and dialog box is very time-consuming, but that is what I’ve tried to do. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own categories. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows how the various GUIs compare on the average rank of the four categories. R Commander is abbreviated Rcmdr, and R AnalyticFlow is abbreviated RAF. We see that BlueSky is in the lead with R-Instat close behind. As my detailed reviews of those two point out, they are extremely different pieces of software! Rather than spend more time on this summary plot, let’s examine the four categories separately.

Figure 1. Mean of each R GUI’s ranking of the four categories. To make this plot consistent with the others below, the larger the rank, the better.

For the category of ease-of-use, I’ve defined it mostly by how well each GUI does what GUI users are looking for: avoiding code. They get one point each for being able to install, start, and use the GUI to its maximum effect, including publication-quality output, without knowing anything about the R language itself. Figure two shows the result. JASP comes out on top here, with jamovi and BlueSky right behind.

Figure 2. The number of ease-of-use features that each GUI has.

Figure 3 shows the general usability features each GUI offers. This category is dominated by data-wrangling capabilities, where data scientists and statisticians spend most of their time. This category also includes various types of data input and output. BlueSky and R-Instat come out on top not just due to their excellent selection of data wrangling features but also due to their use of the rio package for importing and exporting files. The rio package combines the import/export capabilities of many other packages, and it is easy to use. I expect the other GUIs will eventually adopt it, raising their scores by around 40 points. JASP shows up at the bottom of this plot due to its philosophy of encouraging users to prepare the data elsewhere before importing it into JASP.

Figure 3. Number of general usability features for each GUI.

Figure 4 shows the number of graphics features offered by each GUI. R-Instat has a solid lead in this category. In fact, this underestimates R-Instat’s ability if you…

Continued…

A Comparative Review of the R-Instat GUI for R

by Robert A. Muenchen

Introduction

R-Instat is a free and open source graphical user interface for the R software that focuses on people who want to point-and-click their way through data science analyses. Written in Visual Basic, it is currently only available for Microsoft Windows. However, a Linux version is in development using the cross-platform Mono implementation of the .NET framework.This post is one of a series of reviews that aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Although I wrote the BlueSky User’s Guide, I hope to remain objective in these reviews. There is no one perfect user interface for everyone; each GUI for R has features that appeal to a different set of people.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

R-Instat is easy to install, requiring only a single step. It provides its own embedded copy of R. This simplifies the installation and ensures complete compatibility between R-Instat and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have R-Instat control any version of R you choose, but if the version differs too much, you may run into occasional problems.

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” that add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Rattle, Deducer) through medium (JASP 15) to high (jamovi 43, R Commander 43).

While the R-Instat project welcomes contributions from anyone, there are not any modules to add at this time. All of its capabilities are included in its initial installation.

Startup

Some user interfaces for R, such as jamovi or JASP, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and then finally call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start R-Instat directly by double-clicking its icon from your desktop or choosing it from your Start Menu (i.e., not from within R).

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

R-Instat starts up by showing its screen (Fig. 1). Under Start, I chose “New Data Frame” and it showed me the rather perplexing dialog shown in Fig. 2.

Figure 1. The R-Instat startup screen.

As an R user, I know what expressions are, but what did the R-Instat designers mean by the term?

Figure 2. The New Dataframe dialog.

Clicking the “Construct Examples” button brought up the suggestions shown in Fig. 3. These are standard R expressions, which came as quite a surprise! It seems that the R-Instat designers are wanting to get people to start using R programming code immediately.

Figure 3. Examples R-Instat provides for expression you can use to create a dataset.

Clicking the Help button brings up the advice, “the simplest option is Empty” (the developers say this will become the default in a future version). Clicking that button brings up a simple prompt for the number of rows and columns you would like to create. After that, you’re looking at a basic spreadsheet (Fig. 4) that easily lets you enter data. As you enter data, it determines if it is numeric or character. Scientific notation is accepted, but dates are saved as character variables. Logical values (TRUE, FALSE) are recognized as such and are stored appropriately.

Right-clicking on any column allows you to convert variables to be a factor, ordered factor, numeric, logical, or character. These changes are recorded as function calls to a custom “convert_column_to_type” function for reproducibility. Such interactive changes are not usually recorded by other R GUIs. Date/time conversion is not available on that menu, as that process is trickier. Those conversions are on the “Prepare> Column Date” menu item. Other things you can do from the right-click menu are: rename, duplicate, reorder, set levels/labels, sort, and filter/remove filter.

The class of each variable is indicated by a character code that follows each variable name in parenthesis: (C) for character, (F) for factor, (O.F) for ordered factor, (D) for date, (L) for logical. When no code follows a variable name, it is numeric.

Figure 4. The R-Instat Data View (left) and Output Window (right).

The name of the dataset appears on a tab at the bottom of the Data View window. This lets you easily manage multiple datasets, an ability that is popular among professionals, but which is rarely offered in R GUIs (BlueSky and R Commander are the only others that offer it).

Once the dataset is saved, to add rows or columns you choose, “Prepare > Data Frame > Insert rows/columns” to add new rows or columns at any position in the data frame. New columns can be added with a specified default value, which can be a big time-saver when entering blocks of related data.

There is a quicker method that works for inserting new rows. You right-click the row numbers and a pop-up menu will allow you to insert rows above or below, and the number of rows selected is the number of rows added – like in Excel.

When editing data, R-Instat lets you type new values on top of the old. As soon as you press the Enter key, it generates R code to execute the change. For example, in a language variable, when changing the value “English” to “Spanish,” it wrote,

Replace Value in Data
data_book$replace_value_in_data(data_name="wakefield", col_name="Language", rows="78", new_value="Spanish")

This is important for reproducibility, but R-Instat is the only GUI reviewed here that tracks such important manual changes. In fact, even among expensive proprietary software, Stata is the only one that I’m aware of that keeps track of such changes using code.

If you have another data set to enter, you can restart the process by choosing “File> New Data…” again. You can change data sets simply by clicking on its tab, and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that is displayed in the editor does not influence what appears in dialog boxes. That means that you can be looking at one dataset while analyzing another! Since each dialog allows you to choose the dataset to use, that is technically not a problem, but if you have several datasets that contain the same variable names, remember that what you see may not be what you get! That’s the opposite of BlueSky Statistics, which automatically analyzes the dataset you see. R-Instat’s ability to work with multiple datasets in a single instance of the software is not a feature found in all R GUIs. For example, jamovi and JASP can only work with a single dataset at a time.

Saving the data is done with a fairly standard “File> Save As> Save Dataset As” menu. By default it will save all open datasets, filters, graphs, and models to a single file called a “data book.” That makes working with complex projects much easier to open and close.

Data Import

R-Instat supports the following file formats, most of which are automatically opened using “File> Import from File”. The ODK and NetCDF file formats have their own Import menus. R-Instat’s ability to open many formats related to climate science hints at what the software excels at. For details, see the Analysis Methods section below.

  1. Comma Separated Values (.csv)
  2. Plain text files (.txt)
  3. Excel (old and new xls file types)
  4. xBASE database files (dBase, etc.)
  5. SPSS (.sav)
  6. SAS binary files (sas7bdat and *.xpt)
  7. Standard R workspace files (RData, but it just opens one dataframe of its choosing)
  8. Open Data Kit (ODK)
  9. OpenRefine
  10. Network Common Data Form (NetCDF)
  11. SST Sea Surface Temperature formatted files
  12. IRI Data Library (API download)
  13. Climate Data Store (CDS) (API download)
  14. Shapefile
  15. Climsoft (Climatic database)
  16. .dly (ASCII files)
  17. .dat (ASCII files)
  18. Tab Separated Values (.tsv)
  19. Stata (.dta)
  20. JSON (.json)
  21. epiinfo (.rec)
  22. Minitab (.mtb)
  23. Systat (.syd). 
  24. CSV with a YAML metadata header (.csvy)
  25. Feather R/Python interchange format (.feather)
  26. Pipe separated files (.psv)
  27. YAML (.yml)
  28. Weka Attribute-Relation File Format (.arff)
  29. Data Interchange Format (.dif)
  30. OpenDocument Spreadsheet (*.ods)
  31. Shallow XML documents (*.xml)
  32. Single-table HTML documents (*.html)

Continued…

BlueSky Statistics Intro and User Guides Now Available

BlueSky Statistics is an easy-to-use menu system that uses the R language to do all its work. My detailed review of BlueSky is available here, and a brief comparison of the various menu systems for R is here. I’ve just released the BlueSky Statistics 7.1 User Guide in printed form on the world’s largest independent bookstore, Lulu.com. A description and detailed table of contents are available here.

Cover design by Kiran Rafiq.

I’ve also released the BlueSky Statistics 7.1 Intro Guide. It is a complete subset of the User Guide, and you can download it for free here (if you have trouble downloading it, your company may have security blocking Microsoft OneDrive; try it at home). Its description and table of contents are here, and soon you will also be able to purchase a printed copy of it from Lulu.com.

Cover design by Kiran Rafiq.

I’m enthusiastic about getting feedback on these books. If you have comments or suggestions, please send them to me at muenchen.bob at gmail dot com.

Other books that feature BlueSky Statistics include:
Introduction to Biomedical Data Science
Applying the Rasch Model in Social Sciences Using R
Data Preparation and Exploration, Applied to Healthcare Data

Publishing with Lulu.com has been a very pleasant experience. They put the author in complete control, making one responsible for every detail of the contents, obtaining reviewers, creating a cover file that includes the front, back, and spine of the book to match the dimensions of the book (e.g. more pages means wider spine, etc.) Advertising is left up to the writer as well, hence this blog post! If you are thinking about writing a book, I highly recommend both Lulu.com and getting a cover design from 99designs.com. The latter let me run a contest in which a dozen artists submitted several ideas each. Their built-in survey system let me ask many colleagues for their opinions to help me decide. Altogether, it was a very interesting experience.

To follow the progress of these and other R related books, subscribe to my blog, or follow me on Twitter.

R GUI Update: BlueSky User’s Guide, New Features

The BlueSky Statistics graphical user interface (GUI) for the R language has added quite a few new features (described below). I’m also working on a BlueSky User Guide, a draft of which you can read about and download here. [Update: don’t download that, get the full Intro Guide download instead.] Although I’m spending a lot of time on BlueSky, I still plan to be as obsessive as ever about reviewing all (or nearly all) of the R GUIs, which is summarized here.

The new data management features in BlueSky are:

  • Date Order Check — this lets you quickly check across the dates stored in many variables, and it reports if it finds any rows whose dates are not always increasing from left to right.
  • Find Duplicates – generates a report of duplicates and saves a copy of the data set from which the duplicates are removed. Duplicates can be based on all variables, or a set of just ID variables.
  • Select First/Last Observation per Group – finding the first or last observation in a group can create new datasets from the “best” or “worst” case in each group, find the most current record, and so on.

Model Fitting / Tuning

One of the more interesting features in BlueSky is its offering of what they call Model Fitting and Model Tuning. Model Fitting gives you direct control over the R function that does the work. That provides precise control over every setting, and it can teach you the code that the menus create, but it also means that model tuning is up to you to do. However, it does standardize scoring so that you do not have to keep up with the wide range of parameters that each of those functions need for scoring. Model Tuning controls models through the caret package, which lets you do things like K-fold cross-validation and model tuning. However, it does not allow control over every model setting.

New Model Fitting menu items are:

  • Cox Proportional Hazards Model: Cox Single Model
  • Cox Multiple Models
  • Cox with Formula
  • Cox Stratified Model
  • Extreme Gradient Boosting
  • KNN
  • Mixed Models
  • Neural Nets: Multi-layer Perceptron
  • NeuralNets (i.e. the package of that name)
  • Quantile Regression

There are so many Model Tuning entries that it’s easier to just paste in the list I updated on the main BlueSkly review that I updated earlier this morning:

  • Model Tuning: Adaboost Classification Trees
  • Model Tuning: Bagged Logic Regression
  • Model Tuning: Bayesian Ridge Regression
  • Model Tuning: Boosted trees: gbm
  • Model Tuning: Boosted trees: xgbtree
  • Model Tuning: Boosted trees: C5.0
  • Model Tuning: Bootstrap Resample
  • Model Tuning: Decision trees: C5.0tree
  • Model Tuning: Decision trees: ctree
  • Model Tuning: Decision trees: rpart (CART)
  • Model Tuning: K-fold Cross-Validation
  • Model Tuning: K Nearest Neighbors
  • Model Tuning: Leave One Out Cross-Validation
  • Model Tuning: Linear Regression: lm
  • Model Tuning: Linear Regression: lmStepAIC
  • Model Tuning: Logistic Regression: glm
  • Model Tuning: Logistic Regression: glmnet
  • Model Tuning: Multi-variate Adaptive Regression Splines (MARS via earth package)
  • Model Tuning: Naive Bayes
  • Model Tuning: Neural Network: nnet
  • Model Tuning: Neural Network: neuralnet
  • Model Tuning: Neural Network: dnn (Deep Neural Net)
  • Model Tuning: Neural Network: rbf
  • Model Tuning: Neural Network: mlp
  • Model Tuning: Random Forest: rf
  • Model Tuning: Random Forest: cforest (uses ctree algorithm)
  • Model Tuning: Random Forest: ranger
  • Model Tuning: Repeated K-fold Cross-Validation
  • Model Tuning: Robust Linear Regression: rlm
  • Model Tuning: Support Vector Machines: svmLinear
  • Model Tuning: Support Vector Machines: svmRadial
  • Model Tuning: Support Vector Machines: svmPoly

You can download the free open-source version from https://BlueSkyStatistics.com.

Biomedical Data Science Textbook Available

By Bob Hoyt & Bob Muenchen

Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.

For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.

The book is available in print and various electronic formats. Because it is self-published, we plan to update it more rapidly than would be possible through traditional publishers.

Below you will find a detailed table of contents and a list of the textbook authors.

Table of Contents​

​OVERVIEW OF BIOMEDICAL DATA SCIENCE

  1. Introduction
  2. Background and history
  3. Conflicting perspectives
    1. the statistician’s perspective
    2. the machine learner’s perspective
    3. the database administrator’s perspective
    4. the data visualizer’s perspective
  4. Data analytical processes
    1. raw data
    2. data pre-processing
    3. exploratory data analysis (EDA)
    4. predictive modeling approaches
    5. types of models
    6. types of software
  5. Major types of analytics
    1. descriptive analytics
    2. diagnostic analytics
    3. predictive analytics (modeling)
    4. prescriptive analytics
    5. putting it all together
  6. Biomedical data science tools
  7. Biomedical data science education
  8. Biomedical data science careers
  9. Importance of soft skills in data science
  10. Biomedical data science resources
  11. Biomedical data science challenges
  12. Future trends
  13. Conclusion
  14. References

​​SPREADSHEET TOOLS AND TIPS

  1. Introduction
    1. basic spreadsheet functions
    1. download the sample spreadsheet
  2. Navigating the worksheet
  3. Clinical application of spreadsheets
    1. formulas and functions
    2. filter
    3. sorting data
    4. freezing panes
    5. conditional formatting
    6. pivot tables
    7. visualization
    8. data analysis
  4. Tips and tricks
    1. Microsoft Excel shortcuts – windows users
    2. Google sheets tips and tricks
  5. Conclusions
  6. Exercises
  7. References

​​BIOSTATISTICS PRIMER

  1. Introduction
  2. Measures of central tendency & dispersion
    1. the normal and log-normal distributions
  3. Descriptive and inferential statistics
  4. Categorical data analysis
  5. Diagnostic tests
  6. Bayes’ theorem
  7. Types of research studies
    1. observational studies
    2. interventional studies
    3. meta-analysis
    4. orrelation
  8. Linear regression
  9. Comparing two groups
    1. the independent-samples t-test
    2. the wilcoxon-mann-whitney test
  10. Comparing more than two groups
  11. Other types of tests
    1. generalized tests
    2. exact or permutation tests
    3. bootstrap or resampling tests
  12. Stats packages and online calculators
    1. commercial packages
    2. non-commercial or open source packages
    3. online calculators
  13. Challenges
  14. Future trends
  15. Conclusion
  16. Exercises
  17. References

​​DATA VISUALIZATION

  1. Introduction
    1. historical data visualizations
    2. visualization frameworks
  2. Visualization basics
  3. Data visualization software
    1. Microsoft Excel
    2. Google sheets
    3. Tableau
    4. R programming language
    5. other visualization programs
  4. Visualization options
    1. visualizing categorical data
    2. visualizing continuous data
  5. Dashboards
  6. Geographic maps
  7. Challenges
  8. Conclusion
  9. Exercises
  10. References

​​INTRODUCTION TO DATABASES

  1. Introduction
  2. Definitions
  3. A brief history of database models
    1. hierarchical model
    2. network model
    3. relational model
  4. Relational database structure
  5. Clinical data warehouses (CDWs)
  6. Structured query language (SQL)
  7. Learning SQL
  8. Conclusion
  9. Exercises
  10. References

BIG DATA

  1. Introduction
  2. The seven v’s of big data related to health care data
  3. Technical background
  4. Application
  5. Challenges
    1. technical
    2. organizational
    3. legal
    4. translational
  6. Future trends
  7. Conclusion
  8. References

​​BIOINFORMATICS and PRECISION MEDICINE

  1. Introduction
  2. History
  3. Definitions
  4. Biological data analysis – from data to discovery
  5. Biological data types
    1. genomics
    2. transcriptomics
    3. proteomics
    4. bioinformatics data in public repositories
    5. biomedical cancer data portals
  6. Tools for analyzing bioinformatics data
    1. command line tools
    2. web-based tools
  7. Genomic data analysis
  8. Genomic data analysis workflow
    1. variant calling pipeline for whole exome sequencing data
    2. quality check
    3. alignment
    4. variant calling
    5. variant filtering and annotation
    6. downstream analysis
    7. reporting and visualization
  9. Precision medicine – from big data to patient care
  10. Examples of precision medicine
  11. Challenges
  12. Future trends
  13. Useful resources
  14. Conclusion
  15. Exercises
  16. References

​​PROGRAMMING LANGUAGES FOR DATA ANALYSIS

  1. Introduction
  2. History
  3. R language
    1. installing R & rstudio
    2. an example R program
    3. getting help in R
    4. user interfaces for R
    5. R’s default user interface: rgui
    6. Rstudio
    7. menu & dialog guis
    8. some popular R guis
    9. R graphical user interface comparison
    10. R resources
  4. Python language
    1. installing Python
    2. an example Python program
    3. getting help in Python
    4. user interfaces for Python
  5. reproducibility
  6. R vs. Python
  7. Future trends
  8. Conclusion
  9. Exercises
  10. References

​​MACHINE LEARNING

  1. Brief history
  2. Introduction
    1. data refresher
    2. training vs test data
    3. bias and variance
    4. supervised and unsupervised learning
  3. Common machine learning algorithms
  4. Supervised learning
  5. Unsupervised learning
    1. dimensionality reduction
    2. reinforcement learning
    3. semi-supervised learning
  6. Evaluation of predictive analytical performance
    1. classification model evaluation
    2. regression model evaluation
  7. Machine learning software
    1. Weka
    2. Orange
    3. Rapidminer studio
    4. KNIME
    5. Google TensorFlow
    6. honorable mention
    7. summary
  8. Programming languages and machine learning
  9. Machine learning challenges
  10. Machine learning examples
    1. example 1 classification
    2. example 2 regression
    3. example 3 clustering
    4. example 4 association rules
  11. Conclusion
  12. Exercises
  13. References

​​ARTIFICIAL INTELLIGENCE

  1. Introduction
    1. definitions
  2. History
  3. Ai architectures
  4. Deep learning
  5. Image analysis (computer vision)
    1. Radiology
    2. Ophthalmology
    3. Dermatology
    4. Pathology
    5. Cardiology
    6. Neurology
    7. Wearable devices
    8. Image libraries and packages
  6. Natural language processing
    1. NLP libraries and packages
    2. Text mining and medicine
    3. Speech recognition
  7. Electronic health record data and AI
  8. Genomic analysis
  9. AI platforms
    1. deep learning platforms and programs
  10. Artificial intelligence challenges
    1. General
    2. Data issues
    3. Technical
    4. Socio economic and legal
    5. Regulatory
    6. Adverse unintended consequences
    7. Need for more ML and AI education
  11. Future trends
  12. Conclusion
  13. Exercises
  14. References

Authors

Brenda Griffith
Technical Writer
Data.World
Austin, TX

Robert Hoyt MD, FACP, ABPM-CI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA

David Hurwitz MD, FACP, ABPM-CI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL

Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO

Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY

Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN

Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN

Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL

​A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.

SAS Language Clone Now Free for Commercial Use

The WPS Analytics’ version of the SAS language is now available in a Community Edition. This edition allows you to run SAS code on datasets of any size for free. Purchasing a commercial license will get you tech support and the ability to run it from the command line, instead of just interactively. The software license details are listed in this table.

While the WPS version of the SAS language doesn’t do everything the version from SAS Institute offers, it does do quite a lot. The complete list of features is available here.

Back in 2009, the SAS Institute filed a lawsuit against the creators of WPS Analytics, World Programming Limited (WPL), in the High Court of England and Wales. SAS Institute lost the case on the grounds that copyright law applies to software source code, not to its functionality. WPL never had access to SAS Institute’s source code, but they did use a SAS educational license to study how it works. SAS Institute lost another software copyright battle in North Carolina courts, but won over the use of their educational license. SAS Institute is suing a third time, hoping to do better by carefully choosing a pro-patent court in East Texas.

Although I prefer using R, I’m a big fan of the SAS language, as well as SAS Institute, which offers superb technical support. However, I agree with the first two court findings. Copyright law should not apply to a computer language, only to a particular set of source code that creates the language.

A Comparative Review of the R AnalyticFlow GUI for R

Introduction

R AnalyticFlow (RAF) is a free and open source graphical user interface (GUI) for the R language that focuses on beginners looking to point-and-click their way through analyses.  What sets it apart from the other half-dozen GUIs for R is that it uses a flowchart-like workflow diagram to control the analysis instead of only menus. In my first programming class back in the Pleistocene Era, my professor told us to never begin a program without doing a flowchart of what you were trying to accomplish. With workflow tools, you get the benefit of the diagram outlining the big picture, while the dialog box settings in each node control what happens at each step. In Figure 1 you can get a good idea of what is happening without any further information.

Another advantage you get with most workflow tools is the ability to reuse workflows very easily because the dataset is read in only once at the beginning. Unfortunately, most of that advantage is missing from R AnalyticFlow (hereafter, “RAF”) since you must specify which dataset is used in every node. The downside to workflow tools is that they’re slightly harder to learn than menu-based systems. This involves learning how to draw a diagram, what flows through it (e.g. datasets, models), and how to generate a single comprehensive reports for the entire analysis.

This post is one of a series of comparative reviews which aim to help non-programmers choose the GUI that is best for them. The reviews all follow a standard template to make comparisons across products easier. These reviews also include a cursory description of the programming support that each GUI offers.

Figure 1. An example workflow from R AnalyticFlow.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics, jamovi, and RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The Help Desks at most universities are flooded with such calls at the beginning of each semester!

RAF is available for Mac, and Linux. Its installation takes four steps:

  1. Install Java, if you don’t already have it installed. This can be tricky as you must match the type of Java to the type of R you use. Most computers these days have 64-bit operating systems. Whether 32-bit or 64-bit, you must use the same “bitness” on all of these steps, or it will not work.
  2. Next, install R if you haven’t already (available here).
  3. Install RAF itself after downloading it from here.
  4. Start RAF. It will prompt you to install some R packages, notably rJava. This step requires Internet access. To install if you don’t have such access, see the RAF website’s About R Packages section for important details on how to proceed (from another machine that does have Internet access, of course).

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

RAF does not offer any plug-in modules, though its developers do provide instruction on how you can create your own.

Startup

Some user interfaces for R, such as BlueSky and jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R Commander and JGR, have you start R, then load a package from your library, and then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start RAF directly by double-clicking its icon from your desktop or choosing it from your Start Menu (i.e. not from within R itself). On my system, I had to right-click the icon and choose, “Run as Administrator” or I would get the message, “Failed to Launch R. Confirm Settings?” If I responded “Yes”, it showed the path to my installation of R, which was already correct. I tried a second computer and it did start, but when it tried to install the JavaGD and rJava packages, it said, “Warning in install.packages (c(“JavaGD”,”rJava”)) : ‘lib = “C:/Program Files/R/R-3.6.1/library” ‘ is not writable. Would you like to use a personal library instead?”

Upon startup, it displays its startup screen, shown in Figure 2. Quick Start puts you into the software with a new Flow window open. New Project starts a new workflow, and Bookmarks give you quick access to existing workflows.

Figure 2. R AnalyticFlow’s Startup Screen.

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

To start entering data, choose “Input> Enter Data” and drag the selection onto the workflow editor window. An empty spreadsheet will appear (Figure 3). You can enter variable names on the first line if you check the “Header: Use 1st Row” box at the bottom of the window. This is the first hint you’ll see that RAF leans on R terminology that can be somewhat esoteric. RAF’s developers could have labeled this choice as “Column Names” but went with the R terminology of “Header” instead. This approach may be confusing for beginners, but if their goal is to learn R, it will help in the long run.

To enter factors (R’s categorical variables), choose the “Options” tab and check, “Convert Characters to Factors”, then RAF will convert the character string variables you enter to factors. Otherwise, it will leave them as characters. Dates remain stored as characters; you have to use “Processing> Set Data Type” node to change them, and they must be entered in the form yyyy-mm-dd.

Figure 3. R Analytic Flow’s data entry screen.

There is no limit to the number of rows and columns you can enter initially. However, once you choose “Run”, the data frame is created and can no longer be edited!

Saving the workflow is done with the standard “File > Save As” menu. You must save each one to its own file. To save the flow and the various objects that it uses such as data frames and models, use “Project > Export”. When receiving a project from a colleague, use “Project> Import” to begin using it.

Data Import

To analyze data, you must first read it. While many R GUIs can import a wide range of data formats such as files created by other statistics programs and databases, RAF can import only text and R objects.

RAF’s text import feature is well done. Once you select an Input File, it quickly scans the file and figures out if variable names are present, the delimiters it uses to separate the columns, and so on. It then displays a “preview” (Figure 4, bottom). It does this quickly since its preview is only on the first 100 rows of data. If the preview displays errors, you then manually change the settings and check the preview until it’s correct. When the preview looks good, you click, “Run”, it will then read all the data.

Figure 4. The Read Text File window.

Data Export

The ability to export data to a wide range of file types helps when you, or other members of your research team, have to use multiple tools to complete a task. Unfortunately, this is a very weak area for R GUIs. Deducer offers no data export at all, and R Commander, and rattle can export only delimited text files (an earlier version of this listed jamovi as having very limited data export; that has now been expanded). Only BlueSky offers a fairly comprehensive set of export options. Unfortunately, RAF falls into the former group, being able only to export data in text and R object files.

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as BlueSky and the R Commander, can handle many, but not all, of them.

RAF handles a fairly basic set of data management tools:

  1. Add/Edit Columns
  2. Rename – Variables in a data frame)
  3. Set Data Type
  4. Select Rows
  5. Select Columns
  6. Missing Values – Sets values as missing, no imputation)
  7. Sort
  8. Sampling
  9. Aggregate
  10. Merge – Various joins
  11. Merge – Adds rows
  12. Manage Objects (copies, deletes, renames)

Workflows, Menus & Dialog Boxes

The goal of pointing & clicking your way through an analysis is to save time by recognizing dialog box settings rather than performing the more difficult task of recalling  programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

RAF uses a unique interface. There are two ways to add build a workflow that guides your analysis. First, you can click on a toolbar icon, which drops down a menu. Click on a selection, and – without releasing the mouse button – drag your selection onto the flow window. In that case, the dialog box with its options opens below the flow area (Figure 3, bottom right).

The second way to use it is to click on a toolbar icon, drop down its menu, click on a selection and immediately release the mouse button. This causes the dialog box to appear floating in the middle of the screen (not shown). When you finish choosing your settings, there is a “Drag to Add” button at the top of the dialog. Clicking that button causes the dialog box to collapse into an icon which you can then drag onto the workflow surface.

Regardless of which method you choose, if you drop the new icon onto the top of one that is already in the workflow, it will move the new icon to the right and draw an arrow (called an “edge”) connecting the older one to the new. If you don’t drop it onto an icon that’s already in your workflow, you can add a connecting arrow later by clicking on the first icon, then choose “Draw Edge” and an arrow will appear aimed to the right (workflows go mostly left to right). The arrow will float around as you move your mouse, until you click on the second icon. A third way to connect the nodes in a flow is to click one icon, hold the Alt key down, then drag to the second icon.

Figure 3 shows the entire RAF window. On the top right is the workflow. Here are the steps I followed to create it:

  1. I chose “Input> Read Text File” and dragged it onto the workflow. The icon’s settings appeared in the bottom right window.
  2. I filled in the dialog box’s settings, then clicked “Run”. It named the icon after the file mydata.csv and a spreadsheet appeared in the upper-right.
  3. I chose “Statistics> Cross Tabulation”, and dragged its icon onto the data icon.
  4. I clicked the downward-facing arrow in the “Group By” box, and chose the variables. The first one I chose (workshop) formed the rows and the second (gender) formed the columns. Unlike most GUIs, there’s no indication of row and column roles.
  5. I clicked “Run Node” at the top of the cross tabulation dialog box. The cross tabulation output appeared in the upper left window (right half). The code that RAF wrote to perform the task appears in the R Console window in the lower left.

You can run an entire flow by clicking “Run Flow” at the top left of the Flow window. While describing the process of building a workflow is tedious, learning to build one is quite easy to learn.

Figure 3. The entire R Analytic Flow window, with Cross Tabulation highlighted. In the top row are the viewer window (left) and flow window (right). In the bottom row are the R console (left) and the dialog box for the chosen icon (right). The Cross Tabulation icon is selected, so its dialog box is shown.

The goal of using a GUI is to make analysis easy, so GUI dialog boxes are usually quite simple to use and include everything that’s relevant within a single box. I looked at all the options in this dialog but could not find one to do a very common test for such a cross-tabulation table: the chi-squared test. RAF uses an aspect of R objects that ends up essentially creating two different types of dialog boxes in separate parts of its interface. R objects contain multiple bits of output. You can display them using generic R functions such as summary() and print(). The output window has radio buttons for those functions (Figure 3, right above the cross-tabulation table). Clicking the “summary” button will call R’s summary() function to display the chi-squared results where the table is currently shown. To study the pattern in the table and the chi-squared results requires clicking back and forth on Table and summary; you can’t get them to both appear on your screen at the same time.

Correlations provide another example. The statistics are shown, but their p-values are not shown until you click on the “summary” button. This approach is confusing for beginners, but good for people wishing to learn R.

A common data analysis task is repeating the same analysis across many variables. For example, you might want to repeat the above cross tabulation (or t-tests, etc.) on many variables at once. This is usually quite easy to accomplish in most GUIs, but not in RAF. Since R’s functions may not offer that ability without using R’s “apply” family of functions (or loops), and RAF does not support such functions, such simple tasks become quite a lot of work when using RAF. You need to add an node to your flow for each and every variable!

Each dialog box has an “Advanced” tab which allows you to enter the name of any R argument(s) in one column, and any value(s) you would like to pass to that argument in another. That’s a nice way to offer graphical control over common tasks, while assuring that every task a function is capable of is still available.

In a complex analysis, workflows can become quite complex and hard to read. A solution to this problem is the concept of a “metanode”. Metanodes allow you t take an entire section of your workflow and collapse it into what appears to be a single node. For example, you might commonly use eight nodes to prepare a dataset for analysis. You could combine all eight into a new node you call “Data Prep”, greatly simplifying the workflow. Unfortunately, RAF does not offer metanodes, as do other workflow-driven data science tools such as KNIME and RapidMiner.

One of the most surprising aspects of RAF’s workflow style is that every node specifies its input and output objects. That means that you can run any analysis with no connecting arrows in your diagram! Rather than be a required feature as with many workflow-based tools, in RAF they offer only the convenience of re-running an entire flow at once.

During GUI-driven analysis, the fact that R is doing the work is quite obvious as the code and any resulting messages appear in the Console window.

Documentation & Training

The only written documentation for RAF is the brief, but easy to follow, R AnalyticFlow 3 Starter Guide. Kamala Valarie has also done a 15-minute video on YouTube showing how to use RAF.

Help

R GUIs provide simple task-by-task dialog boxes that generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception is the R Commander, which lacks help on the dialog boxes themselves.

The level of help that RAF offers is only the built-in R help file for the particular function you’re using. However, I had problems with the help getting stuck and showing me the help file from previous tasks rather than the one I was currently using.

Graphics

The various GUIs available for R handle graphics in several ways. Some, such as R Commander and RKWard, focus on R’s built-in graphics. Others, such as BlueSky Statistics use the popular ggplot2 package. Still others, such as jamovi, use their own functions and integrate them into analysis steps.

GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it. That’s how BlueSky and jamovi work.

RAF uses the very flexible lattice package for all of its graphics. That makes it particularly easy to display “small multiples” of the same plot repeated by levels of another variable or two. There does not appear to be any way to control the style of the plots.

More…

New Versions of R GUIs: BlueSky, JASP, jamovi

It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the latest versions of BlueSky Statistics, JASP, and jamovi. The reviews which initially appeared as blog posts are now regularly-updated pages.

New Features in JASP

Let’s take a look at some of the new features, starting with the version of JASP that was released three hours ago:

  • Interface adjustments
    • Data panel, analysis input panel and results panel can be manipulated much more intuitively with sliders and show/hide buttons
    • Changed the analysis input panel to have an overview of all opened analyses and added the possibility to change titles, to show documentation, and remove analyses
  • Enhanced the navigation through the file menu; it is now possible to use arrow keys or simply hover over the buttons
  • Added possibility to scale the entire application with Ctrl +, Ctrl – and Ctrl 0
  • Added MANOVA
  • Added Confirmatory Factor Analysis
  • Added Bayesian Multinomial Test
  • Included additional menu preferences to customize JASP to your needs
  • Added/updated help files for most analyses
  • R engine updated from 3.4.4 to 3.5.2
  • Added Šidák correction for post-hoc tests (AN(C)OVA)

A complete list of fixes and features is available here. JASP is available for free from their download page.  My comparative review of JASP is here.

New Features in jamovi

Two of the usability features added to jamovi recently are templates and multi-file input. Both are described in detail here.

Templates enable you to save all the steps in your work as a template file. Opening that file in jamovi then lets you open a new dataset and the template will recreate all the previous analyses and graphs using the new data. It provides reusability without having to depend on the R code that GUI users are trying to avoid using.

The multi-file input lets you select many CSV files at once and jamovi will open and stack them all (they must contain common variable names, of course).

Other new analytic features have been added with a set of modeling modules. They’re described in detail here, and a list of some of their capability is below. You can read my full review of jamovi here, and you can download it for free here.

  • OLS Regression (GLM)
  • OLS ANOVA (GLM)
  • OLS ANCOVA (GLM)
  • Random coefficients regression (Mixed)
  • Random coefficients ANOVA-ANCOVA (Mixed)
  • Logistic regression (GZLM)
  • Logistic ANOVA-like model (GZLM)
  • Probit regression (GZLM)
  • Probit ANOVA-like model (GZLM)
  • Multinomial regression (GZLM)
  • Multinomial ANOVA-like model (GZLM)
  • Poisson regression (GZLM)
  • Poisson ANOVA-like model (GZLM)
  • Overdispersed Poisson regression (GZLM)
  • Overdispersed Poisson ANOVA-like model (GZLM)
  • Negative binomial regression (GZLM)
  • Negative binomial ANOVA-like model (GZLM)
  • Continuous and categorical independent variables
  • Omnibus tests and parameter estimates
  • Confidence intervals
  • Simple slopes analysis
  • Simple effects
  • Post-hoc tests
  • Plots for up to three-way interactions for both categorical and continuous independent variables.
  • Automatic selection of best estimation methods and degrees of freedom selection
  • Type III estimation

New Features in BlueSky Statistics

The BlueSky developers have been working on adding psychometric methods (for a book that is due out soon) and support for distributions. My full review is here and you can download BlueSky Statistics for free here.

  • Model Fitting: IRT: Simple Rasch Model
  • Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
  • Model Fitting: IRT: Partial Credit Model
  • Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
  • Model Fitting: IRT: Rating Scale Model
  • Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
  • Model Statistics: IRT: ICC Plots
  • Model Statistics: IRT: Item Fit
  • Model Statistics: IRT: Plot PI Map
  • Model Statistics: IRT: Item and Test Information
  • Model Statistics: IRT: Likelihood Ratio and Beta plots
  • Model Statistics: IRT: Personfit
  • Distributions: Continuous: BetaProbabilities
  • Distributions: Continuous: Beta Quantiles
  • Distributions: Continuous: Plot Beta Distribution
  • Distributions: Continuous: Sample from Beta Distribution
  • Distributions: Continuous: Cauchy Probabilities
  • Distributions: Continuous: Plot Cauchy Distribution
  • Distributions: Continuous: Cauchy Quantiles
  • Distributions: Continuous: Sample from Cauchy Distribution
  • Distributions: Continuous: Sample from Cauchy Distribution
  • Distributions: Continuous: Chi-squared Probabilities
  • Distributions: Continuous: Chi-squared Quantiles
  • Distributions: Continuous: Plot Chi-squared Distribution
  • Distributions: Continuous: Sample from Chi-squared Distribution
  • Distributions: Continuous: Exponential Probabilities
  • Distributions: Continuous: Exponential Quantiles
  • Distributions: Continuous: Plot Exponential Distribution
  • Distributions: Continuous: Sample from Exponential Distribution
  • Distributions: Continuous: F Probabilities
  • Distributions: Continuous: F Quantiles
  • Distributions: Continuous: Plot F Distribution
  • Distributions: Continuous: Sample from F Distribution
  • Distributions: Continuous: Gamma Probabilities
  • Distributions: Continuous: Gamma Quantiles
  • Distributions: Continuous: Plot Gamma Distribution
  • Distributions: Continuous: Sample from Gamma Distribution
  • Distributions: Continuous: Gumbel Probabilities
  • Distributions: Continuous: Gumbel Quantiles
  • Distributions: Continuous: Plot Gumbel Distribution
  • Distributions: Continuous: Sample from Gumbel Distribution
  • Distributions: Continuous: Logistic Probabilities
  • Distributions: Continuous: Logistic Quantiles
  • Distributions: Continuous: Plot Logistic Distribution
  • Distributions: Continuous: Sample from Logistic Distribution
  • Distributions: Continuous: Lognormal Probabilities
  • Distributions: Continuous: Lognormal Quantiles
  • Distributions: Continuous: Plot Lognormal Distribution
  • Distributions: Continuous: Sample from Lognormal Distribution
  • Distributions: Continuous: Normal Probabilities
  • Distributions: Continuous: Normal Quantiles
  • Distributions: Continuous: Plot Normal Distribution
  • Distributions: Continuous: Sample from Normal Distribution
  • Distributions: Continuous: t Probabilities
  • Distributions: Continuous: t Quantiles
  • Distributions: Continuous: Plot t Distribution
  • Distributions: Continuous: Sample from t Distribution
  • Distributions: Continuous: Uniform Probabilities
  • Distributions: Continuous: Uniform Quantiles
  • Distributions: Continuous: Plot Uniform Distribution
  • Distributions: Continuous: Sample from Uniform Distribution
  • Distributions: Continuous: Weibull Probabilities
  • Distributions: Continuous: Weibull Quantiles
  • Distributions: Continuous: Plot Weibull Distribution
  • Distributions: Continuous: Sample from Weibull Distribution
  • Distributions: Discrete: Binomial Probabilities
  • Distributions: Discrete: Binomial Quantiles
  • Distributions: Discrete: Binomial Tail Probabilities
  • Distributions: Discrete: Plot Binomial Distribution
  • Distributions: Discrete: Sample from Binomial Distribution
  • Distributions: Discrete: Geometric Probabilities
  • Distributions: Discrete: Geometric Quantiles
  • Distributions: Discrete: Geometric Tail Probabilities
  • Distributions: Discrete: Plot Geometric Distribution
  • Distributions: Discrete: Sample from Geometric Distribution
  • Distributions: Discrete: Hypergeometric Probabilities
  • Distributions: Discrete: Hypergeometric Quantiles
  • Distributions: Discrete: Hypergeometric Tail Probabilities
  • Distributions: Discrete: Plot Hypergeometric Distribution
  • Distributions: Discrete: Sample from Hypergeometric Distribution
  • Distributions: Discrete: Negative Binomial Probabilities
  • Distributions: Discrete: Negative Binomial Quantiles
  • Distributions: Discrete: Negative Binomial Tail Probabilities
  • Distributions: Discrete: Plot Negative Binomial Distribution
  • Distributions: Discrete: Sample from Negative Binomial Distribution
  • Distributions: Discrete: Poisson Probabilities
  • Distributions: Discrete: Poisson Quantiles
  • Distributions: Discrete: Poisson Tail Probabilities
  • Distributions: Discrete: Plot Poisson Distribution
  • Distributions: Discrete: Sample from Poisson Distribution

Comparing Point-and-Click Front Ends for R

For an updated version of this post, see: http://r4stats.com/articles/software-reviews/r-gui-comparison/.

Now that I’ve completed seven detailed reviews of Graphical User Interfaces (GUIs) for R, let’s compare them. It’s easy enough to count their features and plot them, so let’s start there. I’m basing the counts on the number of menu items in each category. That’s not too hard to get, but it’s far from perfect. Some software has fewer menu choices, depending instead on dialog box choices. Studying every menu and dialog box would be too time-consuming, so be aware of this limitation. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own graphs. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows the number of analytic methods each software supports on the x-axis and the number of graphics methods on the y-axis. The analytic methods count combines statistical features, machine learning / artificial intelligence ones (ML/AI), and the ability to create R model objects. The graphics features count totals up the number of bar charts, scatterplots, etc. each package can create.

The ideal place to be in this graph is in the upper right corner. We see that BlueSky and R Commander offer quite a lot of both analytic and graphical features. Rattle stands out as having the second greatest number of graphics features. JASP is the lowest on graphics features and 3rd from the bottom on analytic ones.

Next, let’s swap out the y-axis for general usability features. These consist of a variety of features that make your work easier, including data management capabilities (see appendix for details).

Figure 2 shows that BlueSky and R Commander still in the top two positions overall, but now Deducer has nearly caught up with R Commander on the number of general features. That’s due to its reasonably strong set of data management tools, plus its output is in true word processing tables saving you the trouble of formatting it yourself. Rattle is much lower in this plot since, while its graphics capabilities are strong (at least in relation to ML/AI tasks), it has minimal data management capabilities.

These plots help show us three main overall feature sets, but each package offers things that the others don’t. Let’s look at a brief overview of each. Remember that each of these has a detailed review that follows my standard template. I’ll start with the two that have come out on top, then follow in alphabetical order.

The R Commander – This is the oldest GUI, having been around since at least 2005. There are an impressive 41 plug-ins developed for it. It is currently the only R GUI that saves R Markdown files, but it does not create word processing tables by default, as some of the others do. The R code it writes is classic, rarely using the newer tidyverse functions. It works as a partner to R; you install R separately, then use it to install and start R Commander. It makes it easy to blend menu-based analysis with coding. If your goal is to learn to code in classic R, this is an excellent choice.

BlueSky Statistics – This software was created by former SPSS employees and it shares many of SPSS’ features. BlueSky is only a few years old, and it converted from commercial to open source just a few months ago. Although BlueSky and R Commander offer many of the same features, they do them in different ways. When using BlueSky, it’s not initially apparent that R is involved at all. Unless you click the “Syntax” button that every dialog box has, you’ll never see the R code or the code editor. Its output is in publication-quality tables which follow the popular style of the American Psychological Association.

Deducer – This has a very nice-looking interface, and it’s probably the first to offer true word processing tables by default. Being able to just cut and paste a table into your word processor saves a lot of time and it’s a feature that has been copied by several others. Deducer was released in 2008, and when I first saw it, I thought it would quickly gain developers. It got a few, but development seems to have halted. Deducer’s installation is quite complex, and it depends on the troublesome Java software. It also used JGR, which never became as popular as the similar RStudio. The main developer, Ian Fellows, has moved on to another very interesting GUI project called Vivid.

jamovi – The developers who form the core of the jamovi project used to be part of the JASP team. Despite the fact that they started a couple of years later, they’re ahead of JASP in several ways at the moment. Its developers decided that the R code it used should be visible and any R code should be executable, something that differentiated it from JASP. jamovi has an extremely interactive interface that shows you the result of every selection in each dialog box. It also saves the settings in every dialog box, and lets you re-use every step on a new dataset by saving a “template.” That’s extremely useful since GUI users often don’t want to learn R code. jamovi’s biggest weakness its dearth of data management tasks, though there are plans to address that.

JASP – The biggest advantage JASP offers is its emphasis on Bayesian analysis. If that’s your preference, this might be the one for you. At the moment JASP is very different from all the other GUIs reviewed here because it won’t show you the R code it’s writing, and you can’t execute your own R code from within it. Plus the software has not been open to outside developers. The development team plans to address those issues, and their deep pockets should give them an edge.

Rattle – If your work involves ML/AI (a.k.a. data mining) instead of standard statistical methods, Rattle may be the best GUI for you. It’s focused on ML/AI, and its tabbed-based interface makes quick work of it. However, it’s the weakest of them all when it comes to statistical analysis. It also lacks many standard data management features. The only other GUI that offers many ML/AI features is BlueSky.

RKWard – This GUI blends a nice point-and-click interface with an integrated development environment that is the most advanced of all the other GUIs reviewed here. It’s easy to install and start, and it saves all your dialog box settings, allowing you to rerun them. However, that’s done step-by-step, not all at once as jamovi’s templates allow. The code RKWard creates is classic R, with no tidyverse at all.

Conclusion

I hope this brief comparison will help you choose the R GUI that is right for you. Each offers unique features that can make life easier for non-programmers. If one catches your eye, don’t forget to read the full review of it here.

Acknowledgements

Writing this set of reviews has been a monumental undertaking. It would not have been possible without the assistance of Bruno Boutin, Anil Dabral, Ian Fellows, John Fox, Thomas Friedrichsmeier, Rachel Ladd, Jonathan Love, Ruben Ortiz, Christina Peterson, Josh Price, Eric-Jan Wagenmakers, and Graham Williams.

Appendix: Guide to Scoring

In figures 1 and 2, Analytic Features adds up: statistics, machine learning / artificial intelligence, the ability to create R model objects, and the ability to validate models using techniques such as k-fold cross-validation. The Graphics Features is the sum of two rows, the number of graphs the software can create plus one point for small multiples, or facets, if it can do them. Usability is everything else, with each row worth 1 point, except where noted.

FeatureDefinition
Simple
installation
Is it done in one step?
Simple start-upDoes it start on its own without starting R, loading
packages, etc.?
Import Data FilesHow many files types can it import?
Import
Database
How many databases can it read from?
Export Data FilesHow many file formats can it write to?
Data EditorDoes it have a data editor?
Can work on >1 fileCan it work on more than one file at a time?
Variable
View
Does it show metadata in a variable view, allowing for many fast edits to metadata?
Data
Management
How many data management tasks can it do?
Transform
Many
Can it transform many variables at once?
Graph TypesHow many graph types does it have?
Small
Multiples
Can it show small multiples (facets)?
Model ObjectsCan it create R model objects?
StatisticsHow many statistical methods does it have?
ML/AIHow many ML / AI methods does it have?
Model ValidationDoes it offer model validation (k-fold, etc.)?
R Code IDECan you edit and execute R code?
GUI ReuseDoes it let you re-use work without code?
Code ReuseDoes it let you rerun all using code?
Package ManagementDoes it manage packages for you?
Table of ContentsDoes output have a table of contents?
Re-orderCan you re-order output?
Publication QualityIs output in publication quality by default?
R MarkdownCan it create R Markdown?
Add
comments
Can you add comments to output?
Group-by Does it do group-by repetition of any other task?
Output as
Input
Does it save equivalent to broom’s tidy, glance, augment? (They earn 1 point for each)
Developer toolsDoes it offer developer tools?

Scores

FeatureBlueSkyDeducerJASPjamoviRattleRcmdrRKWard
Simple installation1011001
Simple start-up1111001
Import Data Files71345975
Import Database5000100
Export Data Files5714133
Data Editor1101011
Can work on >1 file1100000
Variable View1100000
Data Management309239254
Transform Many1101110
Graph Types2516912242114
Small Multiples1100010
Model Objects1100011
Statistics9637264489522
ML/AI90001200
Model Validation1000100
R Code IDE1101001
GUI Reuse0010001
Code Reuse1101111
Package Management1011000
Output: Table of Contents1000000
Output: Re-order0000000
Output: Publication Quality1111000
Output: R Markdown0000010
Output: Add comments0010010
Group-by / Split File1000000
Output as Input3100010
Developer tools1101011
Total1969448776716056