Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.
For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.
The book is available in print
and various electronic formats. Because it is self-published, we plan to update it more rapidly than would be
possible through traditional publishers.
Below you will find a detailed table of contents and a list
of the textbook authors.
Table of Contents
OVERVIEW OF BIOMEDICAL DATA SCIENCE
Background and history
the statistician’s perspective
the machine learner’s perspective
the database administrator’s perspective
the data visualizer’s perspective
Data analytical processes
exploratory data analysis (EDA)
predictive modeling approaches
types of models
types of software
Major types of analytics
predictive analytics (modeling)
putting it all together
Biomedical data science tools
Biomedical data science education
Biomedical data science careers
Importance of soft skills in data science
Biomedical data science resources
Biomedical data science challenges
SPREADSHEET TOOLS AND TIPS
basic spreadsheet functions
download the sample spreadsheet
Navigating the worksheet
Clinical application of spreadsheets
formulas and functions
Tips and tricks
Microsoft Excel shortcuts – windows users
Google sheets tips and tricks
Measures of central tendency & dispersion
the normal and log-normal distributions
Descriptive and inferential statistics
Categorical data analysis
Types of research studies
Comparing two groups
the independent-samples t-test
the wilcoxon-mann-whitney test
Comparing more than two groups
Other types of tests
exact or permutation tests
bootstrap or resampling tests
Stats packages and online calculators
non-commercial or open source packages
historical data visualizations
Data visualization software
R programming language
other visualization programs
visualizing categorical data
visualizing continuous data
INTRODUCTION TO DATABASES
A brief history of database models
Relational database structure
Clinical data warehouses (CDWs)
Structured query language (SQL)
The seven v’s of big data related to health care data
BIOINFORMATICS and PRECISION MEDICINE
Biological data analysis – from data to discovery
Biological data types
bioinformatics data in public repositories
biomedical cancer data portals
Tools for analyzing bioinformatics data
command line tools
Genomic data analysis
Genomic data analysis workflow
variant calling pipeline for whole exome sequencing data
variant filtering and annotation
reporting and visualization
Precision medicine – from big data to patient care
Examples of precision medicine
PROGRAMMING LANGUAGES FOR DATA ANALYSIS
installing R & rstudio
an example R program
getting help in R
user interfaces for R
R’s default user interface: rgui
menu & dialog guis
some popular R guis
R graphical user interface comparison
an example Python program
getting help in Python
user interfaces for Python
R vs. Python
training vs test data
bias and variance
supervised and unsupervised learning
Common machine learning algorithms
Evaluation of predictive analytical performance
classification model evaluation
regression model evaluation
Machine learning software
Programming languages and machine learning
Machine learning challenges
Machine learning examples
example 1 classification
example 2 regression
example 3 clustering
example 4 association rules
Image analysis (computer vision)
Image libraries and packages
Natural language processing
NLP libraries and packages
Text mining and medicine
Electronic health record data and AI
deep learning platforms and programs
Artificial intelligence challenges
Socio economic and legal
Adverse unintended consequences
Need for more ML and AI education
Brenda Griffith Technical Writer Data.World Austin, TX
Robert Hoyt MD, FACP, ABPM-CI, FAMIA Associate Clinical Professor Department of Internal Medicine Virginia Commonwealth University Richmond, VA
David Hurwitz MD, FACP, ABPM-CI Associate CMIO Allscripts Healthcare Solutions Chicago, IL
Madhurima Kaushal MS Bioinformatics Washington University at St. Louis, School of Medicine St. Louis, MO
Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA Assistant Professor New York Medical College Department of Emergency Medicine Valhalla, NY
Karen A. Monsen PhD, RN, FAMIA, FAAN Professor School of Nursing University of Minnesota Minneapolis, MN
Robert Muenchen MS, PSTAT Manager, Research Computing Support University of Tennessee Knoxville, TN
Dallas Snider PhD Chair, Department of Information Technology University of West Florida Pensacola, FL
A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.
The WPS Analytics’ version of the SAS language is now available in a Community Edition. This edition allows you to run SAS code on datasets of any size for free. Purchasing a commercial license will get you tech support and the ability to run it from the command line, instead of just interactively. The software license details are listed in this table.
While the WPS version of the SAS language doesn’t do everything the version from SAS Institute offers, it does do quite a lot. The complete list of features is available here.
Back in 2009, the SAS Institute filed a lawsuit against the creators of WPS Analytics,World Programming Limited (WPL), in the High Court of England and Wales. SAS Institute lost the case on the grounds that copyright law applies to software source code, not to its functionality. WPL never had access to SAS Institute’s source code, but they did use a SAS educational license to study how it works. SAS Institute lost another software copyright battle in North Carolina courts, but won over the use of their educational license. SAS Institute is suing a third time, hoping to do better by carefully choosing a pro-patent court in East Texas.
Although I prefer using R, I’m a big fan of the SAS language, as well as SAS Institute, which offers superb technical support. However, I agree with the first two court findings. Copyright law should not apply to a computer language, only to a particular set of source code that creates the language.
R AnalyticFlow (RAF) is a free and open source graphical user interface (GUI) for the R language that focuses on beginners looking to point-and-click their way through analyses. What sets it apart from the other half-dozen GUIs for R is that it uses a flowchart-like workflow diagram to control the analysis instead of only menus. In my first programming class back in the Pleistocene Era, my professor told us to never begin a program without doing a flowchart of what you were trying to accomplish. With workflow tools, you get the benefit of the diagram outlining the big picture, while the dialog box settings in each node control what happens at each step. In Figure 1 you can get a good idea of what is happening without any further information.
Another advantage you get with most workflow tools is the ability to reuse workflows very easily because the dataset is read in only once at the beginning. Unfortunately, most of that advantage is missing from R AnalyticFlow (hereafter, “RAF”) since you must specify which dataset is used in every node. The downside to workflow tools is that they’re slightly harder to learn than menu-based systems. This involves learning how to draw a diagram, what flows through it (e.g. datasets, models), and how to generate a single comprehensive reports for the entire analysis.
This post is one of a series of comparative reviews which aim to help non-programmers choose the GUI that is best for them. The reviews all follow a standard template to make comparisons across products easier. These reviews also include a cursory description of the programming support that each GUI offers.
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics, jamovi, and RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The Help Desks at most universities are flooded with such calls at the beginning of each semester!
RAF is available for Mac, and Linux. Its installation takes four steps:
Install Java, if you don’t already have it installed. This can be tricky as you must match the type of Java to the type of R you use. Most computers these days have 64-bit operating systems. Whether 32-bit or 64-bit, you must use the same “bitness” on all of these steps, or it will not work.
Next, install R if you haven’t already (available here).
Install RAF itself after downloading it from here.
Start RAF. It will prompt you to install some R packages, notably rJava. This step requires Internet access. To install if you don’t have such access, see the RAF website’s About R Packages section for important details on how to proceed (from another machine that does have Internet access, of course).
When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).
RAF does not offer any plug-in modules, though its developers do provide instruction on how you can create your own.
Some user interfaces for R, such as BlueSky and jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R Commander and JGR, have you start R, then load a package from your library, and then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.
You start RAF directly by double-clicking its icon from your desktop or choosing it from your Start Menu (i.e. not from within R itself). On my system, I had to right-click the icon and choose, “Run as Administrator” or I would get the message, “Failed to Launch R. Confirm Settings?” If I responded “Yes”, it showed the path to my installation of R, which was already correct. I tried a second computer and it did start, but when it tried to install the JavaGD and rJava packages, it said, “Warning in install.packages (c(“JavaGD”,”rJava”)) : ‘lib = “C:/Program Files/R/R-3.6.1/library” ‘ is not writable. Would you like to use a personal library instead?”
Upon startup, it displays its startup screen, shown in Figure 2. Quick Start puts you into the software with a new Flow window open. New Project starts a new workflow, and Bookmarks give you quick access to existing workflows.
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.
To start entering data, choose “Input> Enter Data” and drag the selection onto the workflow editor window. An empty spreadsheet will appear (Figure 3). You can enter variable names on the first line if you check the “Header: Use 1st Row” box at the bottom of the window. This is the first hint you’ll see that RAF leans on R terminology that can be somewhat esoteric. RAF’s developers could have labeled this choice as “Column Names” but went with the R terminology of “Header” instead. This approach may be confusing for beginners, but if their goal is to learn R, it will help in the long run.
To enter factors (R’s categorical variables), choose the “Options” tab and check, “Convert Characters to Factors”, then RAF will convert the character string variables you enter to factors. Otherwise, it will leave them as characters. Dates remain stored as characters; you have to use “Processing> Set Data Type” node to change them, and they must be entered in the form yyyy-mm-dd.
There is no limit to the number of rows and columns you can enter initially. However, once you choose “Run”, the data frame is created and can no longer be edited!
Saving the workflow is done with the standard “File > Save As” menu. You must save each one to its own file. To save the flow and the various objects that it uses such as data frames and models, use “Project > Export”. When receiving a project from a colleague, use “Project> Import” to begin using it.
To analyze data, you must first read it. While many R GUIs can import a wide range of data formats such as files created by other statistics programs and databases, RAF can import only text and R objects.
RAF’s text import feature is well done. Once you select an Input File, it quickly scans the file and figures out if variable names are present, the delimiters it uses to separate the columns, and so on. It then displays a “preview” (Figure 4, bottom). It does this quickly since its preview is only on the first 100 rows of data. If the preview displays errors, you then manually change the settings and check the preview until it’s correct. When the preview looks good, you click, “Run”, it will then read all the data.
The ability to export data to a wide range of file types helps when you, or other members of your research team, have to use multiple tools to complete a task. Unfortunately, this is a very weak area for R GUIs. Deducer offers no data export at all, and R Commander, and rattle can export only delimited text files (an earlier version of this listed jamovi as having very limited data export; that has now been expanded). Only BlueSky offers a fairly comprehensive set of export options. Unfortunately, RAF falls into the former group, being able only to export data in text and R object files.
It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as BlueSky and the R Commander, can handle many, but not all, of them.
RAF handles a fairly basic set of data management tools:
Rename – Variables in a data frame)
Set Data Type
Missing Values – Sets values as missing, no imputation)
Merge – Various joins
Merge – Adds rows
Manage Objects (copies, deletes, renames)
Workflows, Menus & Dialog Boxes
The goal of pointing & clicking your way through an analysis is to save time by recognizing dialog box settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.
RAF uses a unique interface. There are two ways to add build a workflow that guides your analysis. First, you can click on a toolbar icon, which drops down a menu. Click on a selection, and – without releasing the mouse button – drag your selection onto the flow window. In that case, the dialog box with its options opens below the flow area (Figure 3, bottom right).
The second way to use it is to click on a toolbar icon, drop down its menu, click on a selection and immediately release the mouse button. This causes the dialog box to appear floating in the middle of the screen (not shown). When you finish choosing your settings, there is a “Drag to Add” button at the top of the dialog. Clicking that button causes the dialog box to collapse into an icon which you can then drag onto the workflow surface.
Regardless of which method you choose, if you drop the new icon onto the top of one that is already in the workflow, it will move the new icon to the right and draw an arrow (called an “edge”) connecting the older one to the new. If you don’t drop it onto an icon that’s already in your workflow, you can add a connecting arrow later by clicking on the first icon, then choose “Draw Edge” and an arrow will appear aimed to the right (workflows go mostly left to right). The arrow will float around as you move your mouse, until you click on the second icon. A third way to connect the nodes in a flow is to click one icon, hold the Alt key down, then drag to the second icon.
Figure 3 shows the entire RAF window. On the top right is the workflow. Here are the steps I followed to create it:
I chose “Input> Read Text File” and dragged it onto the workflow. The icon’s settings appeared in the bottom right window.
I filled in the dialog box’s settings, then clicked “Run”. It named the icon after the file mydata.csv and a spreadsheet appeared in the upper-right.
I chose “Statistics> Cross Tabulation”, and dragged its icon onto the data icon.
I clicked the downward-facing arrow in the “Group By” box, and chose the variables. The first one I chose (workshop) formed the rows and the second (gender) formed the columns. Unlike most GUIs, there’s no indication of row and column roles.
I clicked “Run Node” at the top of the cross tabulation dialog box. The cross tabulation output appeared in the upper left window (right half). The code that RAF wrote to perform the task appears in the R Console window in the lower left.
You can run an entire flow by clicking “Run Flow” at the top left of the Flow window. While describing the process of building a workflow is tedious, learning to build one is quite easy to learn.
The goal of using a GUI is to make analysis easy, so GUI dialog boxes are usually quite simple to use and include everything that’s relevant within a single box. I looked at all the options in this dialog but could not find one to do a very common test for such a cross-tabulation table: the chi-squared test. RAF uses an aspect of R objects that ends up essentially creating two different types of dialog boxes in separate parts of its interface. R objects contain multiple bits of output. You can display them using generic R functions such as summary() and print(). The output window has radio buttons for those functions (Figure 3, right above the cross-tabulation table). Clicking the “summary” button will call R’s summary() function to display the chi-squared results where the table is currently shown. To study the pattern in the table and the chi-squared results requires clicking back and forth on Table and summary; you can’t get them to both appear on your screen at the same time.
Correlations provide another example. The statistics are shown, but their p-values are not shown until you click on the “summary” button. This approach is confusing for beginners, but good for people wishing to learn R.
A common data analysis task is repeating the same analysis across many variables. For example, you might want to repeat the above cross tabulation (or t-tests, etc.) on many variables at once. This is usually quite easy to accomplish in most GUIs, but not in RAF. Since R’s functions may not offer that ability without using R’s “apply” family of functions (or loops), and RAF does not support such functions, such simple tasks become quite a lot of work when using RAF. You need to add an node to your flow for each and every variable!
Each dialog box has an “Advanced” tab which allows you to enter the name of any R argument(s) in one column, and any value(s) you would like to pass to that argument in another. That’s a nice way to offer graphical control over common tasks, while assuring that every task a function is capable of is still available.
In a complex analysis, workflows can become quite complex and hard to read. A solution to this problem is the concept of a “metanode”. Metanodes allow you t take an entire section of your workflow and collapse it into what appears to be a single node. For example, you might commonly use eight nodes to prepare a dataset for analysis. You could combine all eight into a new node you call “Data Prep”, greatly simplifying the workflow. Unfortunately, RAF does not offer metanodes, as do other workflow-driven data science tools such as KNIME and RapidMiner.
One of the most surprising aspects of RAF’s workflow style is that every node specifies its input and output objects. That means that you can run any analysis with no connecting arrows in your diagram! Rather than be a required feature as with many workflow-based tools, in RAF they offer only the convenience of re-running an entire flow at once.
During GUI-driven analysis, the fact that R is doing the work is quite obvious as the code and any resulting messages appear in the Console window.
R GUIs provide simple task-by-task dialog boxes that generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception is the R Commander, which lacks help on the dialog boxes themselves.
The level of help that RAF offers is only the built-in R help file for the particular function you’re using. However, I had problems with the help getting stuck and showing me the help file from previous tasks rather than the one I was currently using.
The various GUIs available for R handle graphics in several ways. Some, such as R Commander and RKWard, focus on R’s built-in graphics. Others, such as BlueSky Statistics use the popular ggplot2 package. Still others, such as jamovi, use their own functions and integrate them into analysis steps.
GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it. That’s how BlueSky and jamovi work.
RAF uses the very flexible lattice package for all of its graphics. That makes it particularly easy to display “small multiples” of the same plot repeated by levels of another variable or two. There does not appear to be any way to control the style of the plots.
One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found each year for each software. He chose Google Scholar since it searches “across many disciplines and sources: articles, theses, books, abstracts, and court opinions, from academic publishers, professional societies, online repositories, universities, and other web sites.” Figure 1 shows the results from 1995 through 2016. Data collected in 2018 showed that while SPSS use dropped 39% drop from 2017 to 2018, its use was still 66% higher than R in 2018.
We see in the plot that SPSS was extremely dominant for most of that time period. Even after its precipitous decline, it still beats the rest by more than a 2 to 1 margin. Over the years, several people questioned the accuracy of Figure 1. In a time when scholarly publications are proliferating, how could SPSS use be in such decline?
One hypothesis that has often been suggested revolves around one of the most bizarre product name changes in the history of marketing. As a result of a legal battle for control of the name “SPSS”, the SPSS company changed the name of the product to “PASW”, an acronym for Predictive Analytics Software. The change made about as much sense as Coke people renaming Coke to “BSW”, for Bubbly Sugar Water. The battle was settled and in 2011 and the product name reverted back to SPSS.
Could that name change account for the apparent
decline in its use? A search on Google Scholar from 2009 to 2012 on the string:
yielded 12,000 hits. That sounds like quite a few, but when “SPSS” was substituted for “PASW” in that search, we found 701,000 references. At first glance, it seems that the scholarly use of SPSS was undercounted by 1.7%. However, when searching a vast volume of documents, each string may have problems with over-counting. For example, PASW stands for “Plant Available Soil Water” which accounts for 138 of those 12,000 articles. There may be many other such abbreviations. That’s the type of analysis Muenchen did several years ago, before concluding that PASW was more trouble than it was worth (details are here). In 2018 that search yields only 361 hits, and the title of the very first article begins with, “Projections Analysis of Surface Waves (PASW)…”
Muenchen’s hypothesis regarding the apparent decline of SPSS is that it was caused by competition. Back in 2002, SPSS shared the statistical software market with SAS and a couple of others. Its momentum carried it upward for a few more years, then the competition started chipping away at it. GraphPad Prism improved significantly with the release of its version 5 in 2007 and medical users of SPSS found an alternative that was as easy to use while focusing more on their needs. R added enough useful packages around the same time to become competitive. By now there are probably hundreds of packages that people can use to analyze data, only a few of which are shown in Figure 1.
Mackinnon remained skeptical of this hypothesis because the overall graph appears to show decreases in statistical software citation over time. This would seem to contradict evidence that the number of journal articles published has been increasing at about 3% per year over the last 3 centuries, and about 3.9% per year in the past decade (2018 STM Report, pg. 25). Thus, the total number of citations to statistical software as a collective group should be increasing concurrently with this overall increase.
Mackinnon gathered data from a different source: Scopus. According to Wikipedia, “Scopus covers nearly 36,377 titles from approximately 11,678 publishers, of which 34,346 are peer-reviewed journals in top-level subject fields: life sciences, social sciences, physical sciences, and health sciences.” Mackinnon limited the search to reference lists, reasoning that such citations are likely an indicator of using the software in the paper. Two search strings were used:
REF(“the R software” OR “the R
project” OR “r-project.org” OR “R development core”)
These searches are being a bit generous to SPSS by including Modeler and AMOS, and very conservative for R by not including citations to common packages (e.g., ggplot2). The resulting data are plotted in Figure 2.
we see that the citations of R in scholarly journals exceeded that of SPSS back
in 2012. However, the scale of Figure 2 tops out at 30,000 while Figure 1’s
scale peaks at 300,000. Google is finding a lot more documents! So, which of
these software packages is used the most in scholarly work? Good question! We would like to hear your comments below,
especially from readers who collect data from other sources.
It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the latest versions of BlueSky Statistics, JASP, and jamovi. The reviews which initially appeared as blog posts are now regularly-updated pages.
New Features in JASP
Let’s take a look at some of the new features, starting with the version of JASP that was released three hours ago:
Data panel, analysis input panel and results panel can be manipulated much more intuitively with sliders and show/hide buttons
Changed the analysis input panel to have an overview of all opened analyses and added the possibility to change titles, to show documentation, and remove analyses
Enhanced the navigation through the file menu; it is now possible to use arrow keys or simply hover over the buttons
Added possibility to scale the entire application with Ctrl +, Ctrl – and Ctrl 0
Added Confirmatory Factor Analysis
Added Bayesian Multinomial Test
Included additional menu preferences to customize JASP to your needs
Added/updated help files for most analyses
R engine updated from 3.4.4 to 3.5.2
Added Šidák correction for post-hoc tests (AN(C)OVA)
A complete list of fixes and features is available here. JASP is available for free from their download page. My comparative review of JASP is here.
New Features in jamovi
Two of the usability features added to jamovi recently are templates and multi-file input. Both are described in detail here.
Templates enable you to save all the steps in your work as a template file. Opening that file in jamovi then lets you open a new dataset and the template will recreate all the previous analyses and graphs using the new data. It provides reusability without having to depend on the R code that GUI users are trying to avoid using.
The multi-file input lets you select many CSV files at once and jamovi will open and stack them all (they must contain common variable names, of course).
Other new analytic features have been added with a set of modeling modules. They’re described in detail here, and a list of some of their capability is below. You can read my full review of jamovi here, and you can download it for free here.
OLS Regression (GLM)
OLS ANOVA (GLM)
OLS ANCOVA (GLM)
Random coefficients regression (Mixed)
Random coefficients ANOVA-ANCOVA (Mixed)
Logistic regression (GZLM)
Logistic ANOVA-like model (GZLM)
Probit regression (GZLM)
Probit ANOVA-like model (GZLM)
Multinomial regression (GZLM)
Multinomial ANOVA-like model (GZLM)
Poisson regression (GZLM)
Poisson ANOVA-like model (GZLM)
Overdispersed Poisson regression (GZLM)
Overdispersed Poisson ANOVA-like model (GZLM)
Negative binomial regression (GZLM)
Negative binomial ANOVA-like model (GZLM)
Continuous and categorical independent variables
Omnibus tests and parameter estimates
Simple slopes analysis
Plots for up to three-way interactions for both categorical and continuous independent variables.
Automatic selection of best estimation methods and degrees of freedom selection
Type III estimation
New Features in BlueSky Statistics
The BlueSky developers have been working on adding psychometric methods (for a book that is due out soon) and support for distributions. My full review is here and you can download BlueSky Statistics for free here.
Model Fitting: IRT: Simple Rasch Model
Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
Model Fitting: IRT: Partial Credit Model
Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
Model Fitting: IRT: Rating Scale Model
Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
Model Statistics: IRT: ICC Plots
Model Statistics: IRT: Item Fit
Model Statistics: IRT: Plot PI Map
Model Statistics: IRT: Item and Test Information
Model Statistics: IRT: Likelihood Ratio and Beta plots
Model Statistics: IRT: Personfit
Distributions: Continuous: BetaProbabilities
Distributions: Continuous: Beta Quantiles
Distributions: Continuous: Plot Beta Distribution
Distributions: Continuous: Sample from Beta Distribution
Distributions: Continuous: Cauchy Probabilities
Distributions: Continuous: Plot Cauchy Distribution
Distributions: Continuous: Cauchy Quantiles
Distributions: Continuous: Sample from Cauchy Distribution
Distributions: Continuous: Sample from Cauchy Distribution
In my ongoing quest to track The Popularity of Data Science Software, I’ve just updated my analysis of the job market. To save you from reading the entire tome, I’m reproducing that section here.
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what is likely to become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities. It used to have a job trend plotter, but that tool has apparently been shut down.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g. SPSS, Apache Spark) while others are used in data science jobs and more broadly in report-writing jobs (e.g. SAS, Tableau). General-purpose languages (e.g. Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that use them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the graphs in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on May 27, 2019 and February 24, 2017. One might think that a sample of on a single day might not be very stable, but the large number of job sources makes the counts in Indeed.com’s collection of jobs quite consistent. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002.
Figure 1a shows that Python is in the lead with 27,374 jobs, followed by SQL with 25,877. Java and Amazon’s Machine Learning (ML) tools are roughly 25% further below, with jobs in the 17,000s. R and the C variants come next with around 13,000. People frequently compare R and Python, but when it comes to getting a data science job, there are only half as many for R as for Python. That doesn’t mean they’re the same sort of job, of course. I still see more statisticians using R and machine learning people preferring Python, but Python is definitely on a roll! From Hadoop on down, there is a slow decline in jobs. R is also frequently compared to SAS, which has only 8,123 compared to R’s 13,800.
The scale of Figure 1a is so wide that the bottom package, H20 appears to be zero, when in fact there are 257 jobs for it.
To let us compare the less popular software, I plotted them separately in Figure 1b. Mathematica and Julia are the leaders of this set, with around 219 jobs each. The ancient FORTRAN language is still hanging on to life with 195 jobs. The open source WEKA software and IBM’s Watson are next, with around 185 each. From XGBOOST on down, there is a fairly steady slow decline.
There are several tools that use a workflow interface: Enterprise Miner, KNIME, RapidMiner, and SPSS Modeler. They’re all around the same area between 50 and 100 jobs. In many of the other measures of popularity, RapidMiner beats the very similar KNIME tool, but here there are 50% more jobs for the latter. Alteryx is also a workflow-based tool, however, it has pulled away from the pack, appearing back on Figure 1a with 901 jobs.
When interpreting the scale on Figure 1b, what looks like zero is indeed zero. From Systat on down, none of the packages have more than 10 job listings.
It’s important to note that the values shown in Figures 1a and 1b are single points in time. The number of jobs for the more popular software do not change much from day to day. Therefore, the relative rankings of the software shown in Figure 1a is unlikely to change much over the coming year or two. The less popular packages shown in Figure 1b have such low job counts that their ranking is more likely to shift from month to month, though their position relative to the major packages should remain more stable.
Next, let’s look at the change in jobs from the 2017 data to now (2019). Figure 1c shows the percent change for those packages that had at least 100 job listings back in 2017. Without such a limitation, software that goes from 1 job in 2017 to 5 jobs in 2019 would have a 500% increase, but still would be of little interest. Software whose job market is heating up, or growing, is shown in red, while those that are cooling down are shown in blue.
Tensorflow, the deep learning software from Google, is the fastest growing at 523%. Next is Apache Flink, a tool that analyzes streaming data, at 289%. H2O is next, with 150% growth. Caffe is another deep learning framework and its 123% growth reflects the popularity of artificial intelligence algorithms.
Python shows “only” 97% growth, but its popularity was already so high that the 13,471 jobs that it added surpasses the total jobs of many of the other packages!
Tableau is showing a similar rate of growth, though it was a comparably small number of additional jobs, at 4,784.
From the Julia language on down, we see a slowing decrease in growth. I’m surprised to see that jobs for SAS and SPSS are still growing, though barely at 6% and 1%, respectively.
If you enjoyed reading this article, you might be interested in my recent series of reviews on point-and-click front-ends for the R language. I invite you to subscribe to this blog, or follow me on Twitter.
Now that I’ve completed seven detailed reviews of Graphical User Interfaces (GUIs) for R, let’s compare them. It’s easy enough to count their features and plot them, so let’s start there. I’m basing the counts on the number of menu items in each category. That’s not too hard to get, but it’s far from perfect. Some software has fewer menu choices, depending instead on dialog box choices. Studying every menu and dialog box would be too time-consuming, so be aware of this limitation. I’m putting the details of each measure in the appendixso you can adjust the figures and create your own graphs. If you decide to make your own graphs, I’d love to hear from you in the comments below.
Figure 1 shows the number of analytic methods each software supports on the x-axis and the number of graphics methods on the y-axis. The analytic methods count combines statistical features, machine learning / artificial intelligence ones (ML/AI), and the ability to create R model objects. The graphics features count totals up the number of bar charts, scatterplots, etc. each package can create.
The ideal place to be in this graph is in the upper right corner. We see that BlueSky and R Commander offer quite a lot of both analytic and graphical features. Rattle stands out as having the second greatest number of graphics features. JASP is the lowest on graphics features and 3rd from the bottom on analytic ones.
Next, let’s swap out the y-axis for general usability features. These consist of a variety of features that make your work easier, including data management capabilities (see appendix for details).
Figure 2 shows that BlueSky and R Commander still in the top two positions overall, but now Deducer has nearly caught up with R Commander on the number of general features. That’s due to its reasonably strong set of data management tools, plus its output is in true word processing tables saving you the trouble of formatting it yourself. Rattle is much lower in this plot since, while its graphics capabilities are strong (at least in relation to ML/AI tasks), it has minimal data management capabilities.
These plots help show us three main overall feature sets, but each package offers things that the others don’t. Let’s look at a brief overview of each. Remember that each of these has a detailed review that follows my standard template. I’ll start with the two that have come out on top, then follow in alphabetical order.
The R Commander – This is the oldest GUI, having been around since at least 2005. There are an impressive 41 plug-ins developed for it. It is currently the only R GUI that saves R Markdown files, but it does not create word processing tables by default, as some of the others do. The R code it writes is classic, rarely using the newer tidyverse functions. It works as a partner to R; you install R separately, then use it to install and start R Commander. It makes it easy to blend menu-based analysis with coding. If your goal is to learn to code in classic R, this is an excellent choice.
BlueSky Statistics – This software was created by former SPSS employees and it shares many of SPSS’ features. BlueSky is only a few years old, and it converted from commercial to open source just a few months ago. Although BlueSky and R Commander offer many of the same features, they do them in different ways. When using BlueSky, it’s not initially apparent that R is involved at all. Unless you click the “Syntax” button that every dialog box has, you’ll never see the R code or the code editor. Its output is in publication-quality tables which follow the popular style of the American Psychological Association.
Deducer – This has a very nice-looking interface, and it’s probably the first to offer true word processing tables by default. Being able to just cut and paste a table into your word processor saves a lot of time and it’s a feature that has been copied by several others. Deducer was released in 2008, and when I first saw it, I thought it would quickly gain developers. It got a few, but development seems to have halted. Deducer’s installation is quite complex, and it depends on the troublesome Java software. It also used JGR, which never became as popular as the similar RStudio. The main developer, Ian Fellows, has moved on to another very interesting GUI project called Vivid.
jamovi– The developers who form the core of the jamovi project used to be part of the JASP team. Despite the fact that they started a couple of years later, they’re ahead of JASP in several ways at the moment. Its developers decided that the R code it used should be visible and any R code should be executable, something that differentiated it from JASP. jamovi has an extremely interactive interface that shows you the result of every selection in each dialog box. It also saves the settings in every dialog box, and lets you re-use every step on a new dataset by saving a “template.” That’s extremely useful since GUI users often don’t want to learn R code. jamovi’s biggest weakness its dearth of data management tasks, though there are plans to address that.
JASP– The biggest advantage JASP offers is its emphasis on Bayesian analysis. If that’s your preference, this might be the one for you. At the moment JASP is very different from all the other GUIs reviewed here because it won’t show you the R code it’s writing, and you can’t execute your own R code from within it. Plus the software has not been open to outside developers. The development team plans to address those issues, and their deep pockets should give them an edge.
Rattle– If your work involves ML/AI (a.k.a. data mining) instead of standard statistical methods, Rattle may be the best GUI for you. It’s focused on ML/AI, and its tabbed-based interface makes quick work of it. However, it’s the weakest of them all when it comes to statistical analysis. It also lacks many standard data management features. The only other GUI that offers many ML/AI features is BlueSky.
RKWard– This GUI blends a nice point-and-click interface with an integrated development environment that is the most advanced of all the other GUIs reviewed here. It’s easy to install and start, and it saves all your dialog box settings, allowing you to rerun them. However, that’s done step-by-step, not all at once as jamovi’s templates allow. The code RKWard creates is classic R, with no tidyverse at all.
I hope this brief comparison will help you choose the R GUI that is right for you. Each offers unique features that can make life easier for non-programmers. If one catches your eye, don’t forget to read the full review of it here.
Writing this set of reviews has been a monumental undertaking. It would not have been possible without the assistance of Bruno Boutin, Anil Dabral, Ian Fellows, John Fox, Thomas Friedrichsmeier, Rachel Ladd, Jonathan Love, Ruben Ortiz, Christina Peterson, Josh Price, Eric-Jan Wagenmakers, and Graham Williams.
Appendix: Guide to Scoring
In figures 1 and 2, Analytic Features adds up: statistics, machine learning / artificial intelligence, the ability to create R model objects, and the ability to validate models using techniques such as k-fold cross-validation. The Graphics Features is the sum of two rows, the number of graphs the software can create plus one point for small multiples, or facets, if it can do them. Usability is everything else, with each row worth 1 point, except where noted.
Is it done in one step?
Does it start on its own without starting R, loading packages, etc.?
Import Data Files
How many files types can it import?
How many databases can it read from?
Export Data Files
How many file formats can it write to?
Does it have a data editor?
Can work on >1 file
Can it work on more than one file at a time?
Does it show metadata in a variable view, allowing for many fast edits to metadata?
How many data management tasks can it do?
Can it transform many variables at once?
How many graph types does it have?
Can it show small multiples (facets)?
Can it create R model objects?
How many statistical methods does it have?
How many ML / AI methods does it have?
Does it offer model validation (k-fold, etc.)?
R Code IDE
Can you edit and execute R code?
Does it let you re-use work without code?
Does it let you rerun all using code?
Does it manage packages for you?
Table of Contents
Does output have a table of contents?
Can you re-order output?
Is output in publication quality by default?
Can it create R Markdown?
Can you add comments to output?
Does it do group-by repetition of any other task?
Output as Input
Does it save equivalent to broom’s tidy, glance, augment? (They earn 1 point for each)
JASP is a free and open source statistics package that targets beginners looking to point-and-click their way through analyses. This article is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) for R, which best meets their needs. Most of these reviews also include cursory descriptions of the programming support that each GUI offers.
JASP stands for Jeffreys’ Amazing Statistics Program, a nod to the Bayesian statistician, Sir Harold Jeffreys. It is available for Windows, Mac, Linux, and there is even a cloud version. One of JASP’s key features is its emphasis on Bayesian analysis. Most statistics software emphasizes a more traditional frequentist approach; JASP offers both. However, while JASP uses R to do some of its calculations, it does not currently show you the R code it uses, nor does it allow you to execute your own. The developers hope to add that to a future version. Some of JASP’s calculations are done in C++, so getting that converted to R will be a necessary first step on that path.
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics, jamovi, and RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps), and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!
JASP’s single-step installation is extremely easy and includes its own copy of R. So if you already have a copy of R installed, you’ll have two after installing JASP. That’s a good idea though, as it guarantees compatibility with the version of R that it uses, plus a standard R installation by itself is harder than JASP’s.
When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) to very high (R Commander).
For JASP, plug-ins are called “modules” and they are found by clicking the “+” sign at the top of its main screen. That causes a new menu item to appear. However, unlike most other software, the menu additions are not saved when you exit JASP; you must add them every time you wish to use them.
JASP’s modules are currently included with the software’s main download. However, future versions will store them in their own repository rather than on the Comprehensive R Archive Network (CRAN) where R and most of its packages are found. This makes locating and installing JASP modules especially easy.
Currently there are only four add-on modules for JASP:
Summary Stats – provides variations on the methods included in the Common menu
SEM – Structural Equation Modeling using lavaan (this is actually more of a window in which you type R code than a GUI dialog)
Three modules are currently in development: Machine Learning, Circular analyses, and Auditing.
Some user interfaces for R, such as BlueSky, jamovi, and Rkward, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and Deducer, have you start R, then load a package from your library, and then call a function to finally activate the GUI. That’s more appropriate for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.
You start JASP directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including BlueSky and jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a dataset from within it.
JASP is the only program in this set of reviews that lacks a data editor. It has only a data viewer (Figure 2, left). If you point to a cell, a message pops up to say, “double-click to edit data” and doing so will transfer the data to another program where you can edit it. You can choose which program will be used to edit your data in the “Preferences>Data Editing” tab, located under the “hamburger” menu in the upper-right corner. The default is Excel.
When JASP opens a data file, it automatically assigns metadata to the variables. As you can see in Figure 2, it has decided my variable “pretest” was a factor and provided a bar chart showing the counts of every value. For the extremely similar “posttest” variable it decided it was numeric, so it binned the values and provided a more appropriate histogram.
While JASP lacks the ability to edit data directly, it does allow you to edit some of the metadata, such as variable scale and variable (factor levels). I fixed the problem described above by clicking on the icon to the left of each variable name, and changing it from a Venn diagram representing “nominal”, to a ruler for “scale”. Note the use of terminology here, which is statistical rather than based on R’s use of “factor” and “numeric” abxyxas respectively. Teaching R is not part of JASP’s mission.
JASP cannot handle date/time variables other than to read them as character and convert them to factor. Once JASP decides a character or date/time variable is a factor, it cannot be changed.
Clicking on the name of a factor will open a small window on the top of the data viewer where you can over-write the existing labels. Variable names however, cannot be changed without going back to Excel, or whatever editor you used to enter the data.
The ability to import data from a wide variety of formats is extremely important; you can’t analyze what you can’t access. Most of the GUIs evaluated in this series can open a wide range of file types and even pull data from relational databases. JASP can’t read data from databases, but it can import the following file formats:
Comma Separated Values (.csv)
Plain text files (.txt)
SPSS (.sav, but not .zsav, .por)
Open Document Spreadsheet (.ods)
The ability to read SAS and Stata files is planned for a future release. Though based on R, JASP cannot read R data files!
The ability to export data to a wide range of file types helps when you need multiple tools to complete a task. Research is commonly a team effort, and in my experience, it’s rare to have all team members prefer to use the same tools. For these reasons, GUIs such as BlueSky, Deducer, and jamovi offer many export formats. Others, such as R Commander and RKward can create only delimited text files.
A fairly unique feature of JASP is that it doesn’t save just a dataset, but instead it saves the combination of a dataset plus its associated analyses. To save just the dataset, you go to the “File” tab and choose “Export data.” The only export format is comma separated value file (.csv).
It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be computed, transformed, scaled, recoded, or binned; strings and dates need to be manipulated; missing values need to be handled; datasets need to be sorted, stacked, merged, aggregated, transposed, or reshaped (e.g. from “wide” format to “long” and back).
A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time is tedious.
Some GUIs, such as BlueSky and R Commander can handle nearly all of these tasks. Others, such as jamovi and RKWard handle only a few of these functions.
JASP’s data management capabilities are minimal. It has a simple calculator that works by dragging and dropping variable names and math or statistical operators. Alternatively, you can type formulas using R code. Using this approach, you can only modify one variable at time, making day-to-day analysis quite tedious. It’s also unable to apply functions across rows (jamovi handles this via a set of row-specific functions). Using the calculator, I could never figure out how to later edit the formula or even delete a variable if I made an error. I tried to recreate one, but it told me the name was already in use.
You can filter cases to work on a subset of your data. However, JASP can’t sort, stack, merge, aggregate, transpose, or reshape datasets. The lack of combining datasets may be a result of the fact that JASP can only have one dataset open in a given session.
Menus & Dialog Boxes
The goal of pointing and clicking your way through an analysis is to save time by recognizing menu settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.
JASP’s interface uses tabbed windows and toolbars in a way that’s similar to Microsoft Office. As you can see in Figure 3, the “File” tab contains what is essentially a menu, but it’s already in the dropped-down position so there’s no need to click on it. Depending on your selections there, a side menu may pop out, and it stays out without holding the mouse button down.
The built-in set of analytic methods are contained under the “Common” tab. Choosing that yields a shift from menus to toolbar icons shown in Figure 4.
Clicking on any icon on the toolbar causes a standard dialog box to pop out the right side of the data viewer (Figure 2, center). You select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. As soon as you fill in enough options to perform an analysis, its output appears instantly in the output window to the right. Thereafter, every option chosen adds to the output immediately; every option turned off removes output. The dialog box does have an “OK” button, but rather than cause the analysis to run, it merely hides the dialog box, making room for more space for the data viewer and output. Clicking on the output itself causes the associated dialog to reappear, allowing you to make changes.
While nearly all GUIs keep your dialog box settings during your session, JASP keeps those settings in its main file. This allows you to return to a given analysis at a future date and try some model variations. You only need to click on the output of any analysis to have the dialog box appear to the right of it, complete with all settings intact.
Output is saved by using the standard “File> Save” selection.
R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.
JASP’s help files are activated by choosing “Help” from the hamburger menu in the upper right corner of the screen (Figure 5). When checked, a window opens on the right of the output window, and its contents change as you scroll through the output. Given that everything appears in a single window, having a large screen is best.
The help files are very well done, explaining what each choice means, its assumptions, and even journal citations. While there is no reference to the R functions used, nor any link to their help files, the overall set of R packages JASP uses is listed here.
The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as BlueSky, focus on R’s popular ggplot graphics. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it.
There is no “Graphics” menu in JASP; all the plots are created from within the data analysis dialogs. For example, boxplots are found in “Common> Descriptives> Plots.” To get a scatterplot I tried “Common> Regression> Plots” but only residual plots are found there. Next I tried “Common> Descriptives> Plots> Correlation plots” and was able to create the image shown in Figure 6. Apparently, there is no way to get just a single scatterplot.
The plots JASP creates are well done, with a white background and axes that don’t touch at the corners. It’s not clear which R functions are used to create them as their style is not the default from the R’s default graphics package, ggplot2, or lattice.
The most important graphical ability that JASP lacks is the ability to do “small multiples” or “facets”. Faceted plots allow you to compare groups by showing a set of the same type of plot repeated by levels of a categorical variable.
Setting the dots-per-inch is the only graphics adjustment JASP offers. It doesn’t support styles or templates. However, plot editing is planned for a future release.
Here is the selection of plots JASP can create.
Scatter – of residuals
The way statistical models (which R stores in “model objects”) are created and used, is an area on which R GUIs differ the most. The simplest and least flexible approach is taken by RKWard. It tries to do everything you might need in a single dialog box. To an R programmer, that sounds extreme, since R does a lot with model objects. However, neither SAS nor SPSS were able to save models for their first 35 years of existence, so each approach has its merits.
Other GUIs, such as BlueSky and R Commander save R model objects, allowing you to use them for scoring tasks, testing the difference between two models, etc. JASP saves a complete set of analyses, including the steps used to create models. It offers a “Sync Data” option on its File menu that allows you to re-use the entire analysis on a new dataset. However, it does not let you save R model objects.
All of the R GUIs offer a decent set of statistical analysis methods. Some also offer machine learning methods. As you can see from the table below, JASP offers the basics of statistical analysis. Included in many of these are Bayesian measures, such as credible intervals. See Plug-in Modules section above for more analysis types.
3. Binomial Test
4. Contingency Tables (incl. Chi-Squared Test)
5. Correlation: Pearson, Spearman, Kendall
6. Exploratory Factor Analysis (EFA)
7. Linear Regression
8. Logistic Regression
9. Log-Linear Regression
11. Principal Component Analysis (PCA)
12. Repeated Measures ANOVA
13. Reliability Analyses: α, λ6, and ω
14. Structural Equation Modeling (SEM)
15. Summary Stats
16. T-Tests: Independent, Paired, One-Sample
Generated R Code
One of the aspects that most differentiates the various GUIs for R is the code they generate. If you decide you want to save code, what type of code is best for you? The base R code as provided by the R Commander which can teach you “classic” R? The tidyverse code generated by BlueSky Statistics? The completely transparent (and complex) traditional code provided by RKWard, which might be the best for budding R power users?
JASP uses R code behind the scenes, but currently, it does not show it to you. There is no way to extract that code to run in R by itself. The JASP developers have that on their to-do list.
Support for Programmers
Some of the GUIs reviewed in this series of articles include extensive support for programmers. For example, RKWard offers much of the power of Integrated Development Environments (IDEs) such as RStudio or Eclipse StatET. Others, such as jamovi or the R Commander, offer just a text editor with some syntax checking and code completion suggestions.
JASP’s mission is to make statistical analysis easy through the use of menus and dialog boxes. It installs R and uses it internally, but it doesn’t allow you to access that copy (other than in its data calculator.) If you wish to code in R, you need to install a second copy.
Reproducibility & Sharing
One of the biggest challenges that GUI users face is being able to reproduce their work. Reproducibility is useful for re-running everything on the same dataset if you find a data entry error. It’s also useful for applying your work to new datasets so long as they use the same variable names (or the software can handle name changes). Some scientific journals ask researchers to submit their files (usually code and data) along with their written report so that others can check their work.
As important a topic as it is, reproducibility is a problem for GUI users, a problem that has only recently been solved by some software developers. Most GUIs (e.g. the R Commander, Rattle) save only code, but since GUI users don’t write the code, they also can’t read it or change it! Others such as jamovi, RKWard, and the newest version of SPSS, save the dialog box entries and allow GUI users to have reproducibility in the form they prefer.
JASP records the steps of all analyses, providing exact reproducibility. In addition, if you update a data value, all the analyses that used that variable are recalculated instantly. That’s a very useful feature since people coming from Excel expect this to happen. You can also use “File> Sync Data” to open a new data file and rerun all analyses on that new dataset. However, the dataset must have exactly the same variable names in the same order for this to work. Still, it’s a very feature that GUI users will find very useful. If you wish to share your work with a colleague so they too can execute it, they must be JASP users. There is no way to export an R program file for them to use. You need to send them only your JASP file; It contains both the data and the steps you used to analyze it.
A topic related to reproducibility is package management. One of the major advantages to the R language is that it’s very easy to extend its capabilities through add-on packages. However, updates in these packages may break a previously functioning analysis. Years from now you may need to run a variation of an analysis, which would require you to find the version of R you used, plus the packages you used at the time. As a GUI user, you’d also need to find the version of the GUI that was compatible with that version of R.
Some GUIs, such as the R Commander and Deducer, depend on you to find and install R. For them, the problem is left for you to solve. Others, such as BlueSky, distribute their own version of R, all R packages, and all of its add-on modules. This requires a bigger installation file, but it makes dealing with long-term stability as simple as finding the version you used when you last performed a particular analysis. Of course, this depends on all major versions being around for long-term, but for open-source software, there are usually multiple archives available to store software even if the original project is defunct.
JASP if firmly in the latter camp. It provides nearly everything you need in a single download. This includes the JASP interface, R itself, and all R packages that it uses. So for the base package, you’re all set.
Output & Report Writing
Ideally, output should be clearly labeled, well organized, and of publication quality. It might also delve into the realm of word processing through R Markdown, knitr or Sweave documents. At the moment, none of the GUIs covered in this series of reviews meets all of these requirements. See the separate reviews to see how each of the other packages is doing on this topic.
The labels for each of JASP’s analyses are provided by a single main title which is editable, and subtitles, which are not. Pointing at a title will cause a black triangle to appear, and clicking that will drop a menu down to edit the title (the single main one only) or to add a comment below (possible with all titles).
The organization of the output is in time-order only. You can remove an analysis, but you cannot move it into an order that may make more sense after you see it.
While tables of contents are commonly used in GUIs to let you jump directly to a section, or to re-order, rename, or delete bits of output, that feature is not available in JASP.
Those limitations aside, JASP’s output quality is very high, with nice fonts and true rich text tables (Figure 7). Tabular output is displayed in the popular style of the American Psychological Association. That means you can right-click on any table and choose “Copy” and the formatting is retained. That really helps speed your work as R output defaults to mono-spaced fonts that require additional steps to get into publication form (e.g. using functions from packages such as xtable or texreg). You can also export an entire set of analyses to HTML, then open the nicely-formatted tables in Word.
LaTeX users can right-click on any output table and choose “Copy special> LaTeX code” to to recreate the table in that text formatting language.
Repeating an analysis on different groups of observations is a core task in data science. Software needs to provide an ability to select a subset one group to analyze, then another subset to compare it to. All the R GUIs reviewed in this series can do this task. JASP allows you to select the observation to analyze in two ways. First, clicking the funnel icon located at the upper left corner of the data viewer opens a window that allows you to enter your selection logic, such as “gender = Female”. From an R code perspective, it does not use R’s “==” symbol for logical equivalence, nor does it allow you to put value labels in quotes. It generates a subset that you can analyze in the same way as the entire dataset. Second, you can click on the name of a factor, then check or un-check the values you wish to keep. Either way, the data viewer grays out the excluded data lines to give you a visual cue.
Software also needs the ability to automate such selections so that you might generate dozens of analyses, one group at a time. While this has been available in commercial GUIs for decades (e.g. SPSS “split-file”, SAS “by” statement), BlueSky is the only R GUI reviewed here that includes this feature. The closest JASP gets on this topic is to offer a “split” variable selection box in its Descriptives procedure.
Early in the development of statistical software, developers tried to guess what output would be important to save to a new dataset (e.g. predicted values, factor scores), and the ability to save such output was built into the analysis procedures themselves. However, researchers were far more creative than the developers anticipated. To better meet their needs, output management systems were created and tacked on to existing tools (e.g. SAS’ Output Delivery System, SPSS’ Output Management System). One of R’s greatest strengths is that every bit of output can be readily used as input. However, for the simplification that GUIs provide, that’s a challenge.
Output data can be observation-level, such as predicted values for each observation or case. When group-by analyses are run, the output data can also be observation-level, but now the (e.g.) predicted values would be created by individual models for each group, rather than one model based on the entire original data set (perhaps with group included as a set of indicator variables).
You can also use group-by analyses to create model-level data sets, such as one R-squared value for each group’s model. You can also create parameter-level data sets, such as the p-value for each regression parameter for each group’s model. (Saving and using single models is covered under “Modeling” above.)
For example, in our organization, we have 250 departments and want to see if any of them have a gender bias on salary. We write all 250 regression models to a dataset, and then search to find those whose gender parameter is significant (hoping to find none, of course!)
BlueSky is the only R GUI reviewed here that does all three levels of output management. JASP not only lacks these three levels of output management, it even lacks the fundamental observation-level saving that SAS and SPSS offered in their first versions back in the early 1970s. This entails saving predicted values or residuals from regression, or scores from principal components analysis or factor analysis. The developers plan to add that capability to a future release.
While most of the R GUI projects encourage module development by volunteers, the JASP project hasn’t done so. However, this is planned for a future release.
JASP is easy to learn and use. The tables and graphs it produces follow the guidelines of the Americal Psychological Association, making them acceptable by many scientific journals without any additional formatting. Its developers have chosen their options carefully so that each analysis includes what a researcher would want to see. Its coverage of Bayesian methods is the most extensive I’ve seen in this series of software reviews.
As nice as JASP is, it lacks important features, including: a data editor, an R code editor, the ability to see the R code it writes, the ability to handle date/time variables, the ability to read/write R, SAS, and Stata data files, the ability to perform many more fundamental data management tasks, the ability to save new variables such as predicted values or factor scores, the ability to save models so they can be tested on hold-out samples or new data sets, and the ability to reuse an analysis on new data sets using the GUI. While those are quite a few features to add, JASP is funded by several large grants from the Dutch Science Foundation and the ERC, allowing them to guarantee continuous and ongoing development.
Thanks to Eric-Jan Wagenmakers and Bruno Boutin for their help in understanding JASP’s finer points. Thanks also to Rachel Ladd, Ruben Ortiz, Christina Peterson, and Josh Price for their editorial suggestions. Edit
In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I collect data again for the previous years (with one exception noted below).
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 3/28/2019.
SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied.
Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.
The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.
The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.
To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.
I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages are growing, while the use of the older ones is shrinking quite rapidly.
Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools which are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that much less expensive – or free – depending on how you use them (KNIME desktop is free but server is not; RapidMiner is free for analyzing fewer than 10,000 cases).
Another interesting comparison on Figure 2b is JASP and jamovi. Both are open-source tools that focus on statistics rather than machine learning or artificial intelligence. They both use graphical user interfaces (GUIs) in a style that is similar to SPSS. Both also use R behind the scenes to do their calculations. JASP emphasizes Bayesian Analysis and hides its R code; jamovi has a more frequentist orientation, it lets you see its R code, and it lets you execute your own R code directly from within it. JASP currently has nine times as many citations here, though jamovi’s use is growing much more rapidly.
Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open source model a few months ago.
While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.
The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g. Mahout down, Spark up).
Of course another summary is: cheap (or free) up; expensive down. Of the growing packages, 13 out of 17 are available in open source. Of those in decline, only 5 out of 13 are open source.
Statistics software has been around much longer than AI/ML software, started back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them the more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.
At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP with 223% growth, and jamovi with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.
Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.
I was surprised to see IBM Watson growing a healthy 34% as much of the news about it has not been good. It’s awesome at Jeopardy though!
In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).
The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 1015. The result is shown in Figure 2e.
Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.
While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in March, 2019. Since Google’s algorithm changes, preventing the new data from matching exactly with the old, this new data starts at 2015 so the two sets overlap. SPSS is not shown on this graph because its dominance would compress the y-axis, making trends in the others harder to see. However, keep in mind that despite SPSS’ 39% drop from 2017 to 2018, its use is still 66% higher than R’s in 2018! Apparently people are willing to pay for ease of use.
In Figure 2f we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect R use to stabilize at this level while use of Stata continues to slowly decline.
SPSS’ long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.
These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the job advertisements that list science software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
Update: an earlier version of this post included figures that I’ve removed at the request of Forrester, Inc.
In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I describe Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.
Forrester Research, Inc. is a leading global research and advisory firm that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.
Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports.
Forrester divides the vendors into four categories: Leaders, Strong Performers, Contenders, and Challengers.
In the Leaders category, they include IBM, while Gartner viewed them as a middle-of-the-pack Visionary. Forrester and Gartner both view SAS and RapidMiner as leaders.
The Strong Performers category includes KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. Forrester has KNIME and SAP next to each other in this category, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating to Gartner.
The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.
Forrester’s Challengers segment includes World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems. Neither were considered by Gartner.
Forrester rates some of the notebook-based vendors very differently than Gartner. Here Domino Data Labs is a Leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a Leader, though its strength is this market is minimal.
In the Strong Performers category are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this category; neither were reviewed by Gartner. Cloudera is here as well; it too was left out by Gartner.
Forrester’s Condenders category contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.
The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.
As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!