A Comparative Review of the JASP Statistical Software

Introduction

JASP is a free and open source statistics package that targets beginners looking to point-and-click their way through analyses. This article is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) for R, which best meets their needs. Most of these reviews also include cursory descriptions of the programming support that each GUI offers.

JASP stands for Jeffreys’ Amazing Statistics Program, a nod to the Bayesian statistician, Sir Harold Jeffreys. It is available for Windows, Mac, Linux, and there is even a cloud version. One of JASP’s key features is its emphasis on Bayesian analysis. Most statistics software emphasizes a more traditional frequentist approach; JASP offers both. However, while JASP uses R to do some of its calculations, it does not currently show you the R code it uses, nor does it allow you to execute your own. The developers hope to add that to a future version. Some of JASP’s calculations are done in C++, so getting that converted to R will be a necessary first step on that path.


Figure 1. JASP’s main screen.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statisticsjamovi, and RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps), and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

JASP’s single-step installation is extremely easy and includes its own copy of R. So if you already have a copy of R installed, you’ll have two after installing JASP. That’s a good idea though, as it guarantees compatibility with the version of R that it uses, plus a standard R installation by itself is harder than JASP’s.

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) to very high (R Commander).

For JASP, plug-ins are called “modules” and they are found by clicking the “+” sign at the top of its main screen. That causes a new menu item to appear. However, unlike most other software, the menu additions are not saved when you exit JASP; you must add them every time you wish to use them.

JASP’s modules are currently included with the software’s main download. However, future versions will store them in their own repository rather than on the Comprehensive R Archive Network (CRAN) where R and most of its packages are found. This makes locating and installing JASP modules especially easy.

Currently there are only four add-on modules for JASP:

  1. Summary Stats – provides variations on the methods included in the Common menu
  2. SEM – Structural Equation Modeling using lavaan (this is actually more of a window in which you type R code than a GUI dialog)
  3. Meta Analysis
  4. Network Analysis

Three modules are currently in development: Machine Learning, Circular
analyses, and Auditing.

Startup

Some user interfaces for R, such as BlueSky, jamovi, and Rkward, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and Deducer, have you start R, then load a package from your library, and then call a function to finally activate the GUI. That’s more appropriate for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start JASP directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including BlueSky and jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a dataset from within it.

JASP is the only program in this set of reviews that lacks a data editor. It has only a data viewer (Figure 2, left). If you point to a cell, a message pops up to say, “double-click to edit data” and doing so will transfer the data to another program where you can edit it. You can choose which program will be used to edit your data in the “Preferences>Data Editing” tab, located under the “hamburger” menu in the upper-right corner. The default is Excel.

When JASP opens a data file, it automatically assigns metadata to the variables. As you can see in Figure 2, it has decided my variable “pretest” was a factor and provided a bar chart showing the counts of every value. For the extremely similar “posttest” variable it decided it was numeric, so it binned the values and provided a more appropriate histogram.

While JASP lacks the ability to edit data directly, it does allow you to edit some of the metadata, such as variable scale and variable (factor levels). I fixed the problem described above by clicking on the icon to the left of each variable name, and changing it from a Venn diagram representing “nominal”, to a ruler for “scale”. Note the use of terminology here, which is statistical rather than based on R’s use of “factor” and “numeric” abxyxas respectively. Teaching R is not part of JASP’s mission.

JASP cannot handle date/time variables other than to read them as character and convert them to factor. Once JASP decides a character or date/time variable is a factor, it cannot be changed.

Clicking on the name of a factor will open a small window on the top of the data viewer where you can over-write the existing labels. Variable names however, cannot be changed without going back to Excel, or whatever editor you used to enter the data.


Figure 2. The JASP data viewer is shown on the left-hand side.

Data Import

The ability to import data from a wide variety of formats is extremely important; you can’t analyze what you can’t access. Most of the GUIs evaluated in this series can open a wide range of file types and even pull data from relational databases. JASP can’t read data from databases, but it can import the following file formats:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • SPSS (.sav, but not .zsav, .por)
  • Open Document Spreadsheet (.ods)


The ability to read SAS and Stata files is planned for a future release. Though based on R, JASP cannot read R data files!

Data Export

The ability to export data to a wide range of file types helps when you need multiple tools to complete a task. Research is commonly a team effort, and in my experience, it’s rare to have all team members prefer to use the same tools. For these reasons, GUIs such as BlueSky, Deducer, and jamovi offer many export formats. Others, such as R Commander and RKward can create only delimited text files.

A fairly unique feature of JASP is that it doesn’t save just a dataset, but instead it saves the combination of a dataset plus its associated analyses. To save just the dataset, you go to the “File” tab and choose “Export data.”  The only export format is comma separated value file (.csv).

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be computed, transformed, scaled, recoded, or binned; strings and dates need to be manipulated; missing values need to be handled; datasets need to be sorted, stacked, merged, aggregated, transposed, or reshaped (e.g. from “wide” format to “long” and back).

A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time is tedious.

Some GUIs, such as BlueSky and R Commander can handle nearly all of these tasks. Others, such as jamovi and RKWard handle only a few of these functions.

JASP’s data management capabilities are minimal. It has a simple calculator that works by dragging and dropping variable names and math or statistical operators. Alternatively, you can type formulas using R code. Using this approach, you can only modify one variable at time, making day-to-day analysis quite tedious. It’s also unable to apply functions across rows (jamovi handles this via a set of row-specific functions). Using the calculator, I could never figure out how to later edit the formula or even delete a variable if I made an error. I tried to recreate one, but it told me the name was already in use.

You can filter cases to work on a subset of your data. However, JASP can’t sort, stack, merge, aggregate, transpose, or reshape datasets. The lack of combining datasets may be a result of the fact that JASP can only have one dataset open in a given session.

Menus & Dialog Boxes

The goal of pointing and clicking your way through an analysis is to save time by recognizing menu settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as BlueSky and jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

JASP’s interface uses tabbed windows and toolbars in a way that’s similar to Microsoft Office. As you can see in Figure 3, the “File” tab contains what is essentially a menu, but it’s already in the dropped-down position so there’s no need to click on it. Depending on your selections there, a side menu may pop out, and it stays out without holding the mouse button down.

Figure 3. The File tab which shows menu and sub-menu, which are always “dropped down”.

The built-in set of analytic methods are contained under the “Common” tab. Choosing that yields a shift from menus to toolbar icons shown in Figure 4.

Figure 4. Analysis icons shown on the Common tab.

Clicking on any icon on the toolbar causes a standard dialog box to pop out the right side of the data viewer (Figure 2, center). You select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. As soon as you fill in enough options to perform an analysis, its output appears instantly in the output window to the right. Thereafter, every option chosen adds to the output immediately; every option turned off removes output. The dialog box does have an “OK” button, but rather than cause the analysis to run, it merely hides the dialog box, making room for more space for the data viewer and output. Clicking on the output itself causes the associated dialog to reappear, allowing you to make changes.

While nearly all GUIs keep your dialog box settings during your session, JASP keeps those settings in its main file. This allows you to return to a given analysis at a future date and try some model variations. You only need to click on the output of any analysis to have the dialog box appear to the right of it, complete with all settings intact.

Output is saved by using the standard “File> Save” selection.

Documentation & Training

The JASP Materials web page provides links to a helpful array of information to get you started. The How to Use JASP web page offers a cornucopia of training materials, including blogs, GIFs, and videos. The free book, Statistical Analysis in JASP: A Guide for Students, covers the basics of using the software and includes a basic introduction to statistical analysis.

Help

R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.

JASP’s help files are activated by choosing “Help” from the hamburger menu in the upper right corner of the screen (Figure 5). When checked, a window opens on the right of the output window, and its contents change as you scroll through the output. Given that everything appears in a single window, having a large screen is best.

The help files are very well done, explaining what each choice means, its assumptions, and even journal citations. While there is no reference to the R functions used, nor any link to their help files, the overall set of R packages JASP uses is listed here.

Figure 5. JASP with help file open on the left. Click to see a bigger image.

Graphics

The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as BlueSky, focus on R’s popular ggplot graphics. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it.

There is no “Graphics” menu in JASP; all the plots are created from within the data analysis dialogs. For example, boxplots are found in “Common> Descriptives> Plots.” To get a scatterplot I tried “Common> Regression> Plots” but only residual plots are found there. Next I tried “Common> Descriptives> Plots> Correlation plots” and was able to create the image shown in Figure 6. Apparently, there is no way to get just a single scatterplot.

The plots JASP creates are well done, with a white background and axes that don’t touch at the corners. It’s not clear which R functions are used to create them as their style is not the default from the R’s default graphics package, ggplot2, or lattice.

Figure 6. The popular scatterplot is only available as part of a scatterplot matrix.

The most important graphical ability that JASP lacks is the ability to do “small multiples” or “facets”. Faceted plots allow you to compare groups by showing a set of the same type of plot repeated by levels of a categorical variable.

Setting the dots-per-inch is the only graphics adjustment JASP offers. It doesn’t support styles or templates. However, plot editing is planned for a future release.

Here is the selection of plots JASP can create.

  1. Histogram
  2. Density
  3. Box Plots
  4. Violin Plots
  5. Strip Plots
  6. Bar Plots
  7. Scatterplot matrix
  8. Scatter – of residuals
  9. Confidence intervals

Modeling

The way statistical models (which R stores in “model objects”) are created and used, is an area on which R GUIs differ the most. The simplest and least flexible approach is taken by RKWard. It tries to do everything you might need in a single dialog box. To an R programmer, that sounds extreme, since R does a lot with model objects. However, neither SAS nor SPSS were able to save models for their first 35 years of existence, so each approach has its merits.

Other GUIs, such as BlueSky and R Commander save R model objects, allowing you to use them for scoring tasks, testing the difference between two models, etc. JASP saves a complete set of analyses, including the steps used to create models. It offers a “Sync Data” option on its File menu that allows you to re-use the entire analysis on a new dataset. However, it does not let you save R model objects.

Analysis Methods

All of the R GUIs offer a decent set of statistical analysis methods. Some also offer machine learning methods. As you can see from the table below, JASP offers the basics of statistical analysis. Included in many of these are Bayesian measures, such as credible intervals. See Plug-in Modules section above for more analysis types.

AnalysisFrequentistBayesian
1. ANOVA
2. ANCOVA
3. Binomial Test
4. Contingency Tables (incl. Chi-Squared Test)
5. Correlation: Pearson, Spearman, Kendall
6. Exploratory Factor Analysis (EFA)
7. Linear Regression
8. Logistic Regression
9. Log-Linear Regression
10. Multinomial
11. Principal Component Analysis (PCA)
12. Repeated Measures ANOVA
13. Reliability Analyses: α, λ6, and ω
14. Structural Equation Modeling (SEM)
15. Summary Stats
16. T-Tests: Independent, Paired, One-Sample

Generated R Code

One of the aspects that most differentiates the various GUIs for R is the code they generate. If you decide you want to save code, what type of code is best for you? The base R code as provided by the R Commander which can teach you “classic” R? The tidyverse code generated by BlueSky Statistics? The completely transparent (and complex) traditional code provided by RKWard, which might be the best for budding R power users?

JASP uses R code behind the scenes, but currently, it does not show it to you. There is no way to extract that code to run in R by itself. The JASP developers have that on their to-do list.

Support for Programmers

Some of the GUIs reviewed in this series of articles include extensive support for programmers. For example, RKWard offers much of the power of Integrated Development Environments (IDEs) such as RStudio or Eclipse StatET. Others, such as jamovi or the R Commander, offer just a text editor with some syntax checking and code completion suggestions.

JASP’s mission is to make statistical analysis easy through the use of menus and dialog boxes. It installs R and uses it internally, but it doesn’t allow you to access that copy (other than in its data calculator.) If you wish to code in R, you need to install a second copy.

Reproducibility & Sharing

One of the biggest challenges that GUI users face is being able to reproduce their work. Reproducibility is useful for re-running everything on the same dataset if you find a data entry error. It’s also useful for applying your work to new datasets so long as they use the same variable names (or the software can handle name changes). Some scientific journals ask researchers to submit their files (usually code and data) along with their written report so that others can check their work.

As important a topic as it is, reproducibility is a problem for GUI users, a problem that has only recently been solved by some software developers. Most GUIs (e.g. the R Commander, Rattle) save only code, but since GUI users don’t write the code, they also can’t read it or change it! Others such as jamovi, RKWard, and the newest version of SPSS, save the dialog box entries and allow GUI users to have reproducibility in the form they prefer.

JASP records the steps of all analyses, providing exact reproducibility. In addition, if you update a data value, all the analyses that used that variable are recalculated instantly. That’s a very useful feature since people coming from Excel expect this to happen. You can also use “File> Sync Data” to open a new data file and rerun all analyses on that new dataset. However, the dataset must have exactly the same variable names in the same order for this to work. Still, it’s a very feature that GUI users will find very useful. If you wish to share your work with a colleague so they too can execute it, they must be JASP users. There is no way to export an R program file for them to use. You need to send them only your JASP file; It contains both the data and the steps you used to analyze it.

Package Management

A topic related to reproducibility is package management. One of the major advantages to the R language is that it’s very easy to extend its capabilities through add-on packages. However, updates in these packages may break a previously functioning analysis. Years from now you may need to run a variation of an analysis, which would require you to find the version of R you used, plus the packages you used at the time. As a GUI user, you’d also need to find the version of the GUI that was compatible with that version of R.

Some GUIs, such as the R Commander and Deducer, depend on you to find and install R. For them, the problem is left for you to solve. Others, such as BlueSky, distribute their own version of R, all R packages, and all of its add-on modules. This requires a bigger installation file, but it makes dealing with long-term stability as simple as finding the version you used when you last performed a particular analysis. Of course, this depends on all major versions being around for long-term, but for open-source software, there are usually multiple archives available to store software even if the original project is defunct.

JASP if firmly in the latter camp. It provides nearly everything you need in a single download. This includes the JASP interface, R itself, and all R packages that it uses. So for the base package, you’re all set.

Output & Report Writing

Ideally, output should be clearly labeled, well organized, and of publication quality. It might also delve into the realm of word processing through R Markdown, knitr or Sweave documents. At the moment, none of the GUIs covered in this series of reviews meets all of these requirements. See the separate reviews to see how each of the other packages is doing on this topic.

The labels for each of JASP’s analyses are provided by a single main title which is editable, and subtitles, which are not. Pointing at a title will cause a black triangle to appear, and clicking that will drop a menu down to edit the title (the single main one only) or to add a comment below (possible with all titles).

The organization of the output is in time-order only. You can remove an analysis, but you cannot move it into an order that may make more sense after you see it.

While tables of contents are commonly used in GUIs to let you jump directly to a section, or to re-order, rename, or delete bits of output, that feature is not available in JASP.

Those limitations aside, JASP’s output quality is very high, with nice fonts and true rich text tables (Figure 7). Tabular output is displayed in the popular style of the American Psychological Association. That means you can right-click on any table and choose “Copy” and the formatting is retained. That really helps speed your work as R output defaults to mono-spaced fonts that require additional steps to get into publication form (e.g. using functions from packages such as xtable or texreg). You can also export an entire set of analyses to HTML, then open the nicely-formatted tables in Word.

Figure 7. Output as it appers after pasting into Word. All formatting came directly from JASP.

LaTeX users can right-click on any output table and choose “Copy special> LaTeX code” to to recreate the table in that text formatting language.

Group-By Analyses

Repeating an analysis on different groups of observations is a core task in data science. Software needs to provide an ability to select a subset one group to analyze, then another subset to compare it to. All the R GUIs reviewed in this series can do this task. JASP allows you to select the observation to analyze in two ways. First, clicking the funnel icon located at the upper left corner of the data viewer opens a window that allows you to enter your selection logic, such as “gender = Female”. From an R code perspective, it does not use R’s “==” symbol for logical equivalence, nor does it allow you to put value labels in quotes. It generates a subset that you can analyze in the same way as the entire dataset. Second, you can click on the name of a factor, then check or un-check the values you wish to keep. Either way, the data viewer grays out the excluded data lines to give you a visual cue.

Software also needs the ability to automate such selections so that you might generate dozens of analyses, one group at a time. While this has been available in commercial GUIs for decades (e.g. SPSS “split-file”, SAS “by” statement), BlueSky is the only R GUI reviewed here that includes this feature. The closest JASP gets on this topic is to offer a “split” variable selection box in its Descriptives procedure.

Output Management

Early in the development of statistical software, developers tried to guess what output would be important to save to a new dataset (e.g. predicted values, factor scores), and the ability to save such output was built into the analysis procedures themselves. However, researchers were far more creative than the developers anticipated. To better meet their needs, output management systems were created and tacked on to existing tools (e.g. SAS’ Output Delivery System, SPSS’ Output Management System). One of R’s greatest strengths is that every bit of output can be readily used as input. However, for the simplification that GUIs provide, that’s a challenge.

Output data can be observation-level, such as predicted values for each observation or case.  When group-by analyses are run, the output data can also be observation-level, but now the (e.g.) predicted values would be created by individual models for each group, rather than one model based on the entire original data set (perhaps with group included as a set of indicator variables).

You can also use group-by analyses to create model-level data sets, such as one R-squared value for each group’s model. You can also create parameter-level data sets, such as the p-value for each regression parameter for each group’s model. (Saving and using single models is covered under “Modeling” above.)

For example, in our organization, we have 250 departments and want to see if any of them have a gender bias on salary. We write all 250 regression models to a dataset, and then search to find those whose gender parameter is significant (hoping to find none, of course!)

BlueSky is the only R GUI reviewed here that does all three levels of output management. JASP not only lacks these three levels of output management, it even lacks the fundamental observation-level saving that SAS and SPSS offered in their first versions back in the early 1970s. This entails saving predicted values or residuals from regression, or scores from principal components analysis or factor analysis. The developers plan to add that capability to a future release.

Developer Issues

While most of the R GUI projects encourage module development by volunteers, the JASP project hasn’t done so. However, this is planned for a future release.

Conclusion

JASP is easy to learn and use. The tables and graphs it produces follow the guidelines of the Americal Psychological Association, making them acceptable by many scientific journals without any additional formatting. Its developers have chosen their options carefully so that each analysis includes what a researcher would want to see. Its coverage of Bayesian methods is the most extensive I’ve seen in this series of software reviews.

As nice as JASP is, it lacks important features, including: a data editor, an R code editor, the ability to see the R code it writes, the ability to handle date/time variables, the ability to read/write R, SAS, and Stata data files, the ability to perform many more fundamental data management tasks, the ability to save new variables such as predicted values or factor scores, the ability to save models so they can be tested on hold-out samples or new data sets, and the ability to reuse an analysis on new data sets using the GUI. While those are quite a few features to add, JASP is funded by several large grants from the Dutch Science Foundation and the ERC, allowing them to guarantee continuous and ongoing development.

Acknowledgements

Thanks to Eric-Jan Wagenmakers and Bruno Boutin for their help in understanding JASP’s finer points. Thanks also to Rachel Ladd, Ruben Ortiz, Christina Peterson, and Josh Price for their editorial suggestions. Edit

Data Science Software Used in Journals: Stat Packages Declining (including R), AI/ML Software Growing

In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.

Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I collect data again for the previous years (with one exception noted below).

Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 3/28/2019.

Figure 2a. The number of scholarly articles found on Google Scholar, for data science software. Only those with more than 1,700 citations are shown.

SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied.

Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.

The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.

To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.

Figure 2b. Number of scholarly articles using each data science software found using Google Scholar. Only those with fewer than 1,700 citations are shown.

I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages are growing, while the use of the older ones is shrinking quite rapidly.

Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools which are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that much less expensive – or free – depending on how you use them (KNIME desktop is free but server is not; RapidMiner is free for analyzing fewer than 10,000 cases).

Another interesting comparison on Figure 2b is JASP and jamovi. Both are open-source tools that focus on statistics rather than machine learning or artificial intelligence. They both use graphical user interfaces (GUIs) in a style that is similar to SPSS. Both also use R behind the scenes to do their calculations. JASP emphasizes Bayesian Analysis and hides its R code; jamovi has a more frequentist orientation, it lets you see its R code, and it lets you execute your own R code directly from within it. JASP currently has nine times as many citations here, though jamovi’s use is growing much more rapidly.

Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open source model a few months ago.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.

Figure 2c. Change in Google Scholar citation rate in the most recent complete two years, 2017 and 2018.

The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g. Mahout down, Spark up).

Of course another summary is: cheap (or free) up; expensive down. Of the growing packages, 13 out of 17 are available in open source. Of those in decline, only 5 out of 13 are open source.

Statistics software has been around much longer than AI/ML software, started back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them the more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.

At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP with 223% growth, and jamovi with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.

Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.

I was surprised to see IBM Watson growing a healthy 34% as much of the news about it has not been good. It’s awesome at Jeopardy though!

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).

The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of Google Scholar citations for each classic statistics package per year from 1995 through 2016.

SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 1015. The result is shown in Figure 2e.

Figure 2e. The number of Google Scholar citations for each classic statistics package from 1995 through 2016, this time with SPSS removed and SAS included only in 2014 and 2015. The removal of SPSS and SAS expanded scale makes it easier to see the rapid growth of the less popular packages.

Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.

While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in March, 2019. Since Google’s algorithm changes, preventing the new data from matching exactly with the old, this new data starts at 2015 so the two sets overlap. SPSS is not shown on this graph because its dominance would compress the y-axis, making trends in the others harder to see. However, keep in mind that despite SPSS’ 39% drop from 2017 to 2018, its use is still 66% higher than R’s in 2018! Apparently people are willing to pay for ease of use.


Figure 2f. The number of Google Scholar citations for each classic statistics package per year from 2015 through 2018.

In Figure 2f we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect R use to stabilize at this level while use of Stata continues to slowly decline.

SPSS’ long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.

These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.

To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the job advertisements that list science software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!

Data Science Software Reviews: Forrester vs. Gartner

Update: an earlier version of this post included figures that I’ve removed at the request of Forrester, Inc.

In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I describe Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.

Forrester Research, Inc. is a leading global research and advisory firm that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.

Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports.

The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018 covers software that is controllable by various means such as menus, workflows, wizards, or code (as of 23/22/2019 available free here). Forrester plans to cover tools for automated modeling in a separate report, due out in 2019. Given that automation is now a widely adopted feature of the several companies covered in this report, that seems like an odd approach.

Forrester divides the vendors into four categories: Leaders, Strong Performers, Contenders, and Challengers.

In the Leaders category, they include IBM, while Gartner viewed them as a middle-of-the-pack Visionary. Forrester and Gartner both view SAS and RapidMiner as leaders.

The Strong Performers category includes KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. Forrester has KNIME and SAP next to each other in this category, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating to Gartner.

The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.

Forrester’s Challengers segment includes World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems.  Neither were considered by Gartner.

 The Forrester Wave: Notebook-Based Solutions, Q3, 2018 reviews software controlled by notebooks, which blend programming code and output in the same window (as of 3/22/2019 available here).

Forrester rates some of the notebook-based vendors very differently than Gartner. Here Domino Data Labs is a Leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a Leader, though its strength is this market is minimal.

In the Strong Performers category are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this category; neither were reviewed by Gartner. Cloudera is here as well; it too was left out by Gartner.

Forrester’s Condenders category contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.

The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.

As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.

To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!

Gartner’s 2019 Take on Data Science Software

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2019 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging through all 40+ pages of my report, here’s just the updated section:

IT Research Firms

IT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open source software that has no specific company backing, such as R, Python, or jamovi. Even open source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open source software into their proprietary ones.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal companies that are distributing them. On the date of this post, Datarobot is offering free copies.

Gartner, Inc. is one of the research firms that write such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” That software performs a wide range of tasks including data importation, preparation, exploration, visualization, modeling, and deployment.

Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.

Gartner-2019

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms from their 2019 report (plot done in November 2018, report released in 2019).

The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upper-right corner of the plot, the better the combined score.

  • RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in ability to execute, while KNIME offers more vision. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license with “difficult-to-navigate pricing conditions.” These two offer very similar workflow-style user interfaces and have the ability to integrate many open sources tools into their workflows, including R, Python, Spark, and H2O.
  • Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
  • SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixed-effects linear models. Those models are among the very few that R often fails to solve.

The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.

  • Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources such as images, video, and the Internet of Things (IoT). It has also opened up more to open source deep learning projects.
  • H2O.ai is also in the Visionaries quadrant. This is the company behind the open source  H2O software, which is callable from many other packages or languages including R, Python, KNIME, and RapidMiner. While its own menu-based interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for non-coders. H2O’s strength is in modeling but it is lacking in data access and preparation, as well as model management.
  • IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
  • Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of open-source tools into its Apache Spark-based system. However, it scored poorly in the predictability of costs.
  • Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
  • Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a non-trivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
  • Microsoft offers several products, but only its cloud-only Azure Machine Learning (AML) was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for ease-of-use, scalability, and strong partnerships. However, it is weak in automated modeling and AML’s relation to various other Microsoft components is overwhelming (same problem as Google’s toolset).

Figure 3b. Last year’s Gartner Magic Quadrant for Data Science and Machine Learning Platforms (January, 2018)

Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans, or vision.

  • Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflow-based user interface is very similar to that of KNIME and RapidMiner, and it too gets top marks in ease-of-use. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geo-coded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting edge modeling methods such as deep learning, auto-modeling, and the Internet of Things.
  • Dataiku strengthed its ability to execute significantly from last year. It added better scalability to its ease-of-use and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”

Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (includes the former Angoss), Domino, and SAP.

  • Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to production-oriented organizations. This is a tool exclusively for experts in both programming and data science.
  • Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on ease-of-use and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
  • Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools, but low marks for data access and prep, integrating models into day-to-day operations, and customer support.
  • SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.

To see many other ways to rate this type of software, see my ongoing article, The Popularity of Data Science Software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!

Updated Review: jamovi User Interface to R

Last February I reviewed the jamovi menu-based front end to R.  I’ve reviewed five more user interfaces since then, and have developed a more comprehensive template to make it easier to compare them all. Now I’m cycling back to jamovi, using that template to write a far more comprehensive review. I’ve added this review to the previous set, and I’m releasing it as a blog post so that it will be syndicated on R-Bloggers, StatsBlogs, et al.

Introduction

jamovi (spelled with a lower-case “j”) is a free and open source graphical user interface for the R software that targets beginners looking to point-and-click their way through analyses. It is available for Windows, Mac, Linux, and even ChromeOS. Versions are also planned for servers and tablets.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) for R that is best for them. Additionally, these reviews include cursory descriptions of the programming support that each GUI offers.

Figure 1. jamovi’s main screen.

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as BlueSky Statistics or RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps), and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

jamovi’s single-step installation is extremely easy and includes its own copy of R. So if you already have a copy of R installed, you’ll have two after installing jamovi. That’s a good idea though, as it guarantees compatibility with the version of R that it uses, plus a standard R installation by itself is harder than jamovi’s. Python is also installed with jamovi, but it is used only for internal purposes. You can directly control only R through jamovi.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) to very high (R Commander).

For jamovi, plug-ins are called “modules” and they are found in the “jamovi library” rather than on the Comprehensive R Archive Network (CRAN) where R and most of its packages are found. This makes locating and installing jamovi modules especially easy.

Although jamovi is one of the most recent GUIs to appear on the R scene, it has already attracted a respectable number of developers. The list of modules at publication time is listed below. You can check on the latest ones on this web page.

  1. Base R – converts jamovi analyses into standard R functions
  2. blandr – Bland-Altman method comparison analysis, and is also available as an R package from CRAN
  3. Death Watch – survival analysis
  4. Distraction – quantiles and probabilities of continuous and discrete distributions
  5. GAMLj –  general linear model, linear mixed model, generalized linear models, etc.
  6. jpower – power analysis for common research designs
  7. Learning Statistics with jamovi – example data sets to accompany the book learning statistics with jamovi
  8. MAJOR – meta-analysis based on R’s metafor package
  9. medmod – basic mediation and moderation analysis
  10. jAMM – advanced mediation analysis (similar to the popular Process Macro for SAS and SPSS)
  11. R Data Sets
  12. RJ – editor to run R code inside jamovi
  13. scatr – scatter plots with marginal density or box plots
  14. Statkat – helps you choose a statistical test.
  15. TOSTER – tests of equivalence for t-tests and correlation
  16. Walrus – robust descriptive stats & tests
  17. jamovi Arcade – hangman & blackjack games
 

Startup

Some user interfaces for R, such as BlueSky and Rkward, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and then call a function to finally activate the GUI. That’s more appropriate for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start jamovi directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

 

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including BlueSky, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a dataset from within it.

jamovi’s data editor appears at start-up (Figure 1, left) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named A, B, C….

To change metadata, such as variable names, you double click on a name, and window (Figure 2) will slide open from the top with settings for variable name, description, measurement level (continuous, ordinal, nominal, or ID), data type (integer, decimal, text), variable levels (labels) and a “retain unused levels” switch. Currently, jamovi has no date format, which is a serious limitation if you deal with that popular data format.

jamovi data editor settings
Figure 2. The jamovi data editor with the variable attributes window open, allowing you to make changes.

When choosing variable terminology, R GUI designers have two choices: follow what most statistics books use, or instead use R jargon. The jamovi designers have opted for the statistics book terminology. For example, what jamovi calls categorical, decimal, or text are called factor, numeric, or character in R. Both sets of terms are fairly easy to learn, but given that some jamovi users may wish to learn R code, I find that choice puzzling. Changing variable settings can be done to many variables at once, which is an important time saver.

You can enter integer, decimal, or character data in the editor right after starting jamovi. It will recognize those types and set their metadata accordingly.

To enter nominal/factor data, you are free to enter numbers, such as 1/2 and later set levels to see Male/Female appear. Or you can set it up in advance and enter the numbers which will instantly turn into labels. That is a feature that saves time and helps assure accuracy. All data editors should offer that choice!

Adding variables or observations is as simple as scrolling beyond the set’s current limits and entering additional data. jamovi does not require “add more” buttons as some of its competitors (e.g. BlueSky) do. Adding variables or observations in between existing ones is also easy. Under the “Data” tab, there are two sets of “Add” and “Delete” buttons. The first set deals with variables and the second with cases. You can use the first set to insert, compute, transform variables or delete variables. The second inserts, appends, or deletes cases. These two sets of buttons are labeled “Variables” and “Rows”, but the font used is so small that I used jamovi for quite a while before noticing these labels.

 

Data Import

The ability to import data from a wide variety of formats is extremely important; you can’t analyze what you can’t access. Most of the GUIs evaluated in this series can open a wide range of file types and even pull data from relational databases. jamovi can’t read data from databases, but it can import the following file formats:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • SPSS (.sav, .zsav, .por)
  • SAS binary files (.sas7bdat, .xpt)
  • JASP (.jasp)

While jamovi doesn’t support true date/time variables, when you import a dataset that contains them, it will convert them to an integer value representing the number of days since 1970-01-01 and assign them labels in the YYYY-MM-DD format.

 

Data Export

The ability to export data to a wide range of file types helps when you have to use multiple tools to complete a task. Research is commonly a team effort, and in my experience, it’s rare to have all team members prefer to use the same tool. For these reasons, GUIs such as BlueSky and Deducer offer many export formats. Others, such as R Commander and RKward can create only delimited text files.

A fairly unique feature of jamovi is that it doesn’t save just a dataset, but instead it saves the combination of a dataset plus its associated analyses. To save just the dataset, you use the menu (a.k.a. hamburger) menu to select “Export” then “Data.”  The export formats supported are the same as those provided for import, except for the more rarely-used ones such as SAS xpt and SPSS por and zsav:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • SPSS (.sav)
  • SAS binary files (.sas7bdat)

 

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be sorted, stacked, merged, aggregated, transposed, or reshaped (e.g. from “wide” format to “long” and back).

A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time is tedious.

Some GUIs, such as BlueSky and R Commander can handle nearly all of these tasks. Others, such as RKWard handle only a few of these functions.

jamovi’s data management capabilities are minimal. You can transform or recode variables, and doing so across many variables is easy. The transformations are stored in the variable itself, making it easy to see what it was by double-clicking its name. However, the R code for the transformation is not available, even in with Syntax Mode turned on.

You can also filter cases to work on a subset of your data. However, jamovi can’t sort, stack, merge, aggregate, transpose, or reshape datasets. The lack of combining datasets may be a result of the fact that jamovi can only have one dataset open in a given session.

 

Menus & Dialog Boxes

The goal of pointing and clicking your way through an analysis is to save time by recognizing menu settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as BlueSky, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

jamovi uses standard menu choices for running steps listed on the Data and Analyses tabs. Dialog boxes appear and you select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. A unique feature of jamovi is that as soon as you fill in enough options to perform an analysis, its output appears instantly. There is no “OK” or “Run” button as the other GUIs reviewed here have. Thereafter, every option chosen adds to the output immediately; every option turned off is removed.

While nearly all GUIs keep your dialog box settings during your session, jamovi keeps those settings in its main “workspace” file. This allows you to return to a given analysis at a future date and try some model variations. You only need to click on the output of any analysis to have the dialog box appear to the right of it, complete with all settings intact.

Under the triple-dot menu on the upper right side of the screen, you can choose to run “Syntax Mode.” When you turn that on, the R syntax appears immediately, and when you turn it off, it vanishes just as quickly. Turning on syntax mode is the only way a jamovi user would be aware that R is doing the work in the background.

Output is saved by using the standard “Menu> Save” selection.

 

Documentation & Training

The jamovi User Guide covers the basics of using the software. The Resources by the Community web page provides links to a helpful array of documentation and tutorials in written and video form.

 

Help

R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.

jamovi doesn’t offer any integrated help files, only the documentation described in the Documentation & Training section above. The search for help can become very confusing. For example, after doing the scatterplot shown in the next section, I wondered if the scat() function offered a facet argument, normally this would be an easy question to answer. My initial attempt was to go to RStudio, load jamovi’s jmv package knowing that I routinely get help from it. However, the scat() function is not built into jamovi (or jmv); it comes in the scatr add-on module. So I had to return to jamovi and install Rj Editor module. That module lets you execute R code from within jamovi. However, running “help(scat)” yielded no result. After so much confusion, I never was able to find any help on that function. Hopefully, this situation will improve as jamovi matures.

 

Graphics

The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as BlueSky, focus on R’s popular ggplot graphics. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it.

jamovi uses its own graphics functions to create plots. By default, they have the look of the popular ggplot2 package. jamovi is the only R GUI reviewed that lets you set the plot style in advance, and all future plots will use that style. It does this using four popular themes. jamovi also lets you choose color palettes in advance, from a set of eight.

[Continued…]

BlueSky Statistics 5.40 GUI for R Update

It has been just a few months since I reviewed five free and open-source point-and-click graphical user interfaces (GUIs) to the R language. I plan to keep those reviews up to date as new features are added. BlueSky’s interface would be immediately familiar to anyone who had used SPSS, as its developers model it on that popular software. The BlueSky developers’ goal is to help people use R without having to learn to write computer code.

While the previous version of BlueSky offered dozens of fairly advanced modeling methods such as generalized linear models, random forests, and support vector machines, it lacked some simpler features. Version 5.40 (correction: this was previously listed as version 6.04) adds a dialog for logistic regression, which is essentially its glm dialog simplified to do only logistic regression.

The Multi-Way ANOVA output has also been greatly enhanced, with the addition of a wide range of contrasts from the emmeans package, support for all three types of sums of squares, and plots for both post-hoc main-effects comparisons and interaction plots like this one:

BlueSky-Interaction-Plot

Users of RStudio will be pleased that the BlueSky’s program editor now submits lines of code like RStudio does so you can step your way through a program line-by-line, by clicking the Run button repeatedly.

The new version also has functions to do string-to-date vice versa, which led me to realize I had totally missed the string and date functions that it already had. In the “Data> Compute” dialog, the functions for Arithmetic, Logical, Math, and String(1) are visible. But if you click on the “>>” arrow on the right, you’ll also see String (2), Conversion, Statistical, Random Numbers, and four different menus of Date functions.

The complete set of new features and bug fixes is below. You can read my full review of BlueSky here, and you download the software for free here. I plan to write about new features in other R GUIs, so stay tuned to my blog or follow me on twitter where I announce new posts. Happy computing!

NEW FEATURES:
=============

1) Added support for weighted datasets. This option is available in Data -> Set Weights. Once you specify the weighting variable we create a new dataset with rows replicated as defined in the weights. This is similar to what SPSS does internally. When you run frequencies, independent sample t-tests, graphics commands, statistical tests etc on the new dataset (in BlueSky Statistics) with the rows replicated you will see results identical to those in SPSS.

2) Added the option to specify a weighting variable in Linear and Logistic Regression. This allows an optional vector of weights to be used during the fitting process (i.e. the weighted least-squares solution).

3) Added support for logistic regression under ‘Model Fitting’. Once the model is built, you can score the dataset, optionally obtain a confusion matrix, model statistics, and a ROC curve by selecting the model and clicking score. This is available on the top right-hand corner of the main application window.

4) We have updated the “Multi Way Anova” dialog with following capabilities:

  • display contrasts.
  • interaction plots.
  • support for type I, type II and type III tests.
  • pairwise comparison.

5) New reshape dialog with simplified R syntax has been added, using the tidyr package.

6) The “Multi variable one sample T-Test” and “Multi variable independent sample T-Test(with factor)” have been updated to allow you to specify the alternative hypothesis.

7) Added capabilities to support date manipulations.

  • The “String to date” dialog allows you to convert string to POSIXct date class.
  • The “Date to string” dialog allows you to convert the date (POSIXct and Date class) to string.

8) Simplified the syntax for frequencies and factor analysis.

9) If you launch a second instance of BlueSky Statistics the message that gets displayed has been improved. NOTE: You can only have one BlueSky Statistics instance running at a time.

10) We have improved the ability to browse the contents of the output window in BlueSky Statistics. This can be accessed from the menu (Layout > Show Navigation Tree).

11) Items in the output window can now be deleted. To delete an item from the output, just right click on the item(table/text/graphics) and choose the “Delete” option.

12) To make the output visually more appealing, we have introduced an option to hide the R syntax that gets displayed in the output window. This is controlled by an option in the configuration window ( Tools > Configuration Settings > Output tab ). By default we hide the R syntax in the output window.

13) When the option to show R syntax in the output is turned on ( Tools > Configuration Settings > Output tab ) and you resize the output window, we wrap the R syntax so that it is always visible.

14) To run any line of R syntax just place the cursor on that line and hit the RUN button, you don’t have to select the entire line.

15) If you want to run your R script line by line, place your cursor on the first line and hit RUN, the cursor will automatically move to the next line and you can hit the RUN button again.

This feature will work for simple R syntax which does not span multiple lines.

Example 1: 3 lines below work

a=10

b=20

c=a+b

Example 2: 4 lines below will not work

if(TRUE)

{

print(“Great”)

}

16)Added helpful hints to indicate you have reached the beginning and end of a dataset when scrolling wide datasets. This is available on the paging controls on the bottom right-hand corner of the screen.

17) Application launch is now faster.

18) In the BlueSky Statistics syntax editor, just like a comma, the pipe (%>%) can be used to break a long R code statement.

19) When the application launches, we open a new blank dataset this has been populated with zeros. You can right click on a row to delete a row or go into ‘Data -> Delete Variables’ to delete variables.

20) Clicking on a variable name in the data grid sorts in ascending order by that variable name. Clicking again sorts in descending order.

BUG FIXES:

=============

1) Fixed an issue with factor analysis when saving scores using the regression method.

2) Fixed an issue when re-editing factor levels that were previously changed in the variable grid. This would result in the dialog not functioning correctly and incorrect levels being added.

3) Fixed an issue when you closed the empty dataset that gets created when BlueSky Statistics is launched and then attempted to save open datasets- those datasets would not get saved correctly.

4) Fixed an issue that was limiting the number of variables that Factor Analysis can be run across.

5) Within block commands that use ‘local()’, cat(“\n”) can now be printed in the output to leave some extra spaces.

6) When you added a new factor variable and renamed the variable using the user interface and then tried to add new levels – this did not work and has been fixed.

7) Add new factor variable. Click in the cell where the new variable name is shown. Cell goes in edit mode. Now add new factor levels to this new variable. Switch to the data grid and select a different level. Switch back to ‘Variables’ tab and the application crashes. This has been fixed.

8) When any existing factor level name is modified using the user interface, a blank level automatically gets added. If you try to modify the level, it does not take effect. This has been addressed.

9) Disable data grid navigation buttons (on the lower right-hand side of the data grid) if there are less than 16 columns in the dataset.

10) Changing factor levels was not working because one or more levels had a single quote in the level name.

11) Aggregate control fixed: Text above the drop-down (that contains mean, median etc.) was getting chopped off. Similar issue with the label text above a textbox (which was almost at the bottom of the dialog).

12) Fixed significance codes for:

  • One Sample T Test and Independent Sample T Test
  • Multivariable one sample t-test and Multivariable Independent one sample t-test (with factor)

13) Fixed a defect: Select some syntax and hit RUN. After execution, the cursor goes to the top of the script. Now it moves to the next line.

14) Data grid navigation buttons are disabled if either end of the datagrid is reached. If there are no more columns on the right, the right navigation button is disabled. If there are no more columns on the left then left navigation button is disabled.

15) Left navigation tree in the output window is fixed for look and feel. It now has a cleaner look. To access left navigation, go to Layout -> Show navigation tree

16) In the R syntax editor, we now ignore square, curly and round brackets that appear inside single or double quotes. See example below:

grepl(“[“, “a[b”, fixed = TRUE)

 

A Review of Qualtrics, QuestionPro, REDCap, SurveyGizmo, & SurveyMonkey

 Introduction

Web-based surveys offer a quick and effective way to collect data. Several companies sell software-as-a-service which makes the construction of surveys quite easy using only a web browser. At the University of Tennessee, we currently have a system-wide site license for Qualtrics.  Initial discussions suggested an intent from Qualtrics to more than double its price from the previous year. This article describes the process we followed to evaluate and select an alternative, in the interest of providing similar functionality at a reasonable cost.

In choosing the options to review, our first step was to search for online software reviews of tools that we knew were already in use by various groups across all UT campuses and institutes. If one of those provided similar features at a lower price, we could minimize our training and migration expenses. A summary of the reviews’ overall ratings is shown in Table 1.

Source Qualtrics Question Pro REDcap Survey Gizmo Survey Monkey
Captera 5 4.5 5 4.5
Comparisons 4.4 4.45 4.55 4.6
G2Crowd 4.4 4 4.5 4.5
GetApp 5 4.56 5 4.5
SMB Guru 4.25 4.5 3.75
TopTen Reviews   4.965 4.735
Mean Score 4.61 4.5 4 4.75 4.43

Table 1. Summary of web survey ratings from online review sites. Some scores are rescaled to range from 1 through 5. While most scores are available directly from the Source link, to get a score on the Comparisons site you must first choose a comparison of two products, then click on one of them to get its full review.

 

The tools were all highly rated, but we wondered if the raters needed as many features as we use for academic research. A search using Google Scholar confirmed that these were the five survey tools most widely used in scholarly research.

We read all we could find regarding the financial state of each company, read about what  employees said about them on Glassdoor.com, and searched for complaints of any sort regarding the companies or their products. We found no significant problems in any of those areas. The companies all seem to be growing quite rapidly.

We also considered using open source software. However, the most popular open source web survey tool is LimeSurvey, and our previous experience with it was not positive. We investigated reviews of LimeSurvey to see if it had improved since we used it last, but there were very few of them compared to the others.

Selection Process

We formed a committee of ten university faculty and staff, all with substantial survey design expertise. The committee compiled a list of 141 web survey features that we considered important. We then surveyed the university research community, including our current survey tool users, asking them to rate the importance of each feature on a 5-point Likert scale. The detailed table of features and importance ratings appears in the appendix.

We used the features list to write the specifications for a Request for Qualified Suppliers (RFQ-S), a type of Request for Proposal. We specified a 5-year contract including separate pricing for: individuals, groups (e.g. departments, colleges, or institutes), single campus, and multi-campuses; as well as internal use, external use with not-for-profit clients, and external use with for-profit clients.

External use is important because the University of Tennessee is a Carnegie Engaged University, which means one of its prime goals is to perform collaborative research with external organizations. The for-profit category was included because students like to solve the types of problems that companies provide. Our Institute for Public Service also occasionally does surveys for companies in Tennessee. Such use is often prohibited from academic software licenses.

The RFQ-S was sent to the five vendors shown in Table 1. To keep things comparable, REDcap Cloud was chosen as the vendor for REDcap. They offer the same type of software-as-a-service as the other venders, though REDcap is also available for on-premises installations for free.

The bids we received covered an extremely wide range, with the highest price more than ten times larger than the lowest. Each of the responding vendors specified which of the 141 features they offered. Only one vendor, Qualtrics, stated that it was not acceptable to use their software for the benefit of any type of external organizations. Qualtrics requires an expensive commercial license when third parties are involved.

We then tested each feature, verifying the companies’ claims. We considered rating each for ease-of-use and effectiveness, but found that if the feature was implemented at all, it was generally both easy to use and effective.

We then created a composite score which weighted each feature according to the ratings from the feature importance survey. Our purchasing department prefers 1,000-point scales, so we adjusted the scores accordingly. A perfect score of 1,000 would indicate software that offered every feature that a demanding person would describe as Very Important. The resulting scores are shown in Table 2. Readers can use the data in the appendix to develop their own scores. Since REDcap Cloud did not respond to our bid, we did not evaluate or score it. However, does offer an extensive feature set including some very advanced features for database use and clinical trials. We are already using the free on-site version for projects that require those features, but it’s not as easy to use as the other products.

Qualtrics QuestionPro SurveyGizmo SurveyMonkey
Total number
of features
(out of 141)
134 132 130 111
Raw Score 541 529 524 456
Scaled score
(x 1.418)
767 750 744 647

Table 2. Feature counts and scores weighted by feature importance. Each feature is listed in the appendix. See vendors for pricing.

 

QuestionPro and SurveyGizmo offered nearly identical feature sets to Qualtrics, with equivalent ease-of-use, at significantly lower prices. SurveyMonkey offered a smaller set of features, at a price that was much lower than Qualtrics, but still much higher than the others.

We called the three references that QuestionPro and SurveyGizmo had each provided. They all reported that the software worked well, had close to zero downtime, had technical support that was quick and competent, and they all reported planning to continue using their chosen vendor well into the future.

QuestionPro not only scored the highest on the 141 attributes we initially focused on, but it also currently offers advanced features that we had not even considered. The company’s future features roadmap is substantially more advanced than any other product currently on the market, and anything we heard regarding the other vendors’ future plans. As a result, we decided to go with QuestionPro.

Migration Issues

We have changed web survey tools twice before, so we are well aware of the challenges involved. When moving to a new software platform, the software cost is important, but so are migration costs. For example, if we were to attempt a migration from SAS to R in one year’s time, the cost to train people and convert tens of thousands of programs would far outweigh the savings (at least at educational prices). However, survey software is relatively easy to learn, taking an hour to learn how to set up a typical survey. We estimate that a typical survey could be migrated in well under an hour. Complex ones might take much longer, especially those that must migrate longitudinal data along with the survey.

The Qualtrics administrative dashboard allows us to download extensive details about how our people have used the software over the last five years. The total number of accounts is intimidating, at just over  11,000. However, thousands of accounts contain only a single survey with a single response, indicating that the person was simply trying the software out. Thousands more accounts have not been logged into in years. We learned that 80% of surveys each year are created by new users who are starting from scratch and thus have no work to migrate.

We have developed preliminary estimates of migration effort from current usage, and they range between a total of 1,000 and 4,000 hours. Our support staff will be available to help users migrate their projects. We are developing a survey to get more details from current users to better assess migration needs. This will allow us to determine the number of projects that entail additional complexity such as multi-institutional collaboration or longitudinal studies that must maintain long-term data compatibility. We will also be recording the time it takes to move surveys and data to the new system. As we collect such data, we will be able to forecast how long the migration will take.

Conclusion

Our work resulted in the selection of a software package, QuestionPro, which is comparable to Qualtrics at a much lower price. Comparing our final expenditure to the initial price increase that our Qualtrics sales rep requested, the savings over the 5-year life of the contract is over $900,000. In addition, we now have a product that members of the university community can use to solve problems for companies, helping to engage UT more fully into the economy of Tennessee.

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter.

 

Appendix: Web survey tool features and their importance ratings

The importance of each feature (1=very low importance…5= very important) was determined by current users of web survey tools at the University of Tennessee. A rating of zero indicates that the software lacks that feature. This information was current as of 12/1/2017; the vendors all regularly add new features so check their web sites for their latest information.

Feature Qualtrics QuestionPro SurveyGizmo SurveyMonkey
Display Text (e.g. instructions) 4.46 4.46 4.46 4.46
Skip Logic 4.65 4.65 4.65 4.65
Display Logic 4.65 4.65 4.65 4.65
Required Questions / Required Answers 4.64 4.64 4.64 4.64
Redirect Browser at end of survey 3.83 3.83 3.83 3.83
Anonymous Responses (separate survey data from distribution/contact information) 4.51 4.51 4.51 4.51
Embedded Data/Hidden Values/Custom Values 3.85 3.85 3.85 3.85
Survey Collaboration 4.51 4.51 4.51 4.51
Single Response 4.68 4.68 4.68 4.68
Multiple Response 4.70 4.70 4.70 4.70
Likert Grid/Matrix 4.60 4.60 4.60 4.60
Open-ended text 4.60 4.60 4.60 4.60
Email distribution 4.72 4.72 4.72 4.72
Contact Management 4.50 4.50 4.50 4.50
Send Reminders 4.57 4.57 4.57 4.57
Export to CSV 4.69 4.69 4.69 4.69
Export to at least one statistics package (such as SAS, SPSS…) 4.51 4.51 4.51 4.51
SSO login 4.51 4.51 4.51 4.51
Phone Support (for users and admins) 4.51 4.51 4.51 4.51
Integrated Question Design Methodology Advice 0.00 0.00 0.00 3.96
Machine Learning to Improve Response Rate 0.00 3.84 0.00 3.84
Randomization of Questions 3.71 3.71 3.71 3.71
Randomization of Responses 3.57 3.57 3.57 3.57
A/B Test Questions (Split respondents between scenarios) 3.82 3.82 3.82 3.82
Multi-Lingual Surveys 3.68 3.68 3.68 3.68
Quiz Development 3.74 3.74 3.74 3.74
Score Survey 3.89 3.89 3.89 3.89
Custom Survey Templates 4.23 4.23 4.23 4.23
Branch Logic 4.62 4.62 4.62 4.62
Preview Survey for Standard Screen Size 4.58 4.58 4.58 4.58
Preview Survey for Phone Screen Size 4.58 4.58 4.58 4.58
Easy to Create Interface 3.95 3.95 3.95 3.95
Recode Values/Set Reporting Values (i.e. set Yes to 1 and No to 0  or recode on an unsual scale such as 0,1,3,5) 4.16 4.16 4.16 4.16
Custom Javascript Support 3.59 3.59 3.59 0.00
Loop&Merge / Page Piping (loop through the same set of questions a given number of times) 3.94 3.94 3.94 3.94
Insert Piped Text from question or custom value into question text 4.02 4.02 4.02 4.02
Insert Piped Text from question or custom value into question response 3.95 3.95 3.95 3.95
Carry Forward selected answers to populate future questions 4.11 4.11 4.11 4.11
Carry Forward response choices to populate responses on future questions 4.10 4.10 4.10 0.00
Soft-Require Question (request response) 4.41 4.41 4.41 0.00
Optional Progress Bar 4.21 4.21 4.21 4.21
API Integration 3.89 3.89 3.89 3.89
GoogleForm Integration 0.00 3.82 0.00 3.82
Email Triggers/Action Alerts (trigger emails automatically to be sent based on survey responses) 4.17 4.17 4.17 4.17
Contact List Triggers (add people to contact list automatically based on how they answer a survey) 3.98 3.98 0.00 0.00
Edit Next, Submit and Close Button Text 4.23 4.23 4.23 4.23
Hide Next, Submit or Close buttons 4.03 4.03 4.03 0.00
File Library 4.08 4.08 4.08 4.08
Import Surveys Questions from Word 0.00 4.28 4.28 4.28
Export Survey Questions to Word 4.39 4.39 4.39 0.00
Import/Export Survey file (to create a copy that can be exported/imported) 4.61 4.61 4.61 4.61
Copy Survey within survey tool 4.64 4.64 4.64 4.64
Revert to previous versions 4.16 4.16 4.16 0.00
Password/contact list  authentication within a survey (respondent must authenticate with credentials saved within the contact list to continue) 3.90 3.90 3.90 3.90
SSO authentication within a survey (respondent must authenticate with SSO credentials to continue) 3.78 3.78 3.78 3.78
Connect to Web Services (such as random number generator) 3.79 3.79 3.79 0.00
Survey Collaboration within the university/license/brand 4.30 4.30 4.30 4.30
Survey Collaboration outside of the university/license/brand 3.81 0.00 3.81 3.81
Numeric Entry 4.41 4.41 4.41 4.41
Constant Sum 3.56 3.56 3.56 0.00
Slider 3.63 3.63 3.63 3.63
Ranking 4.19 4.19 4.19 4.19
Descriptive Text/Graphic 4.46 4.46 4.46 4.46
Side by Side 3.99 3.99 3.99 3.99
Captcha 3.01 3.01 3.01 0.00
Signature 3.25 3.25 3.25 0.00
Closed Card Sort Questions 3.08 3.08 3.08 0.00
Image Heatmap Questions 2.88 2.88 2.88 0.00
Open Card Sort Questions 0.00 3.08 3.08 0.00
Semantic Differential Questions 3.28 3.28 3.28 3.28
Text Highlighter Questions 3.53 3.53 3.53 0.00
Choice Based Conjoint 3.35 3.35 3.35 0.00
Max Differential (Max Diff) Questions 3.39 3.39 3.39 0.00
Track Time Respondent Stays on a Question or  Page 3.55 3.55 3.55 0.00
Audio/Video Sentiment Questions (slider recording response while video/audio plays) 0.00 3.45 3.45 0.00
File Upload 4.07 4.07 4.07 4.07
Geo-Targeting/Tracking 3.20 3.20 3.20 3.20
Embed External Audio and Video Files 3.81 3.81 3.81 3.81
Real-time Responses 4.13 4.13 4.13 4.13
Filter Responses 4.39 4.39 4.39 4.39
Printed Reports 4.37 4.37 4.37 4.37
Cross-Tabulation Reporting 4.31 4.31 4.31 4.31
Variable Creation 4.20 4.20 0.00 0.00
Response Editing 4.05 4.05 4.05 4.05
View Reports Online 4.67 4.67 4.67 4.67
Export report from within offline report (pdf or other format) 4.51 4.51 4.51 4.51
Import response data from Excel or other format 4.26 4.26 4.26 0.00
Manually enter response data collected externally 4.16 4.16 4.16 4.16
Export reports (Please attach list formats) 4.67 4.67 4.67 4.67
Export Individual Charts 4.39 4.39 4.39 4.39
Create reports from multiple data sources 4.35 4.35 4.35 4.35
Share Contact Lists 4.19 4.19 4.19 4.19
Group organization 4.33 4.33 4.33 4.33
Social Media Distribution 3.88 3.88 3.88 3.88
SMS Distribution 3.60 3.60 0.00 0.00
Template Library of Messages 4.06 4.06 0.00 4.06
Embed First Question in Email Invite 3.29 3.29 3.29 3.29
Mobile First UX 3.95 3.95 3.95 3.95
Response Panels 2.86 2.86 2.86 2.86
URL Shortener 3.91 3.91 3.91 3.91
Survey Quotas 3.43 3.43 3.43 3.43
Schedule email invites and reminders 4.41 4.41 4.41 4.41
Schedule survey to close 4.53 4.53 4.53 4.53
Resend Link after completion 3.31 3.31 3.31 3.31
Send email to Contact List without survey link 3.32 0.00 3.32 3.32
Optional opt out link in emails 4.07 4.07 4.07 4.07
Stand alone Mobile App 4.17 4.17 0.00 4.17
Offline Mode 3.65 3.65 3.65 0.00
SPSS 4.32 4.32 4.32 4.32
PDF 4.31 4.31 4.31 4.31
PowerPoint 3.93 3.93 3.93 3.93
Change variable labels and/or response values before export 4.19 0.00 4.19 4.19
Select if data are exported as Response Text or Response Value 4.37 4.37 4.37 4.37
Export subset of data 4.43 4.43 4.43 4.43
Word Cloud 3.48 3.48 3.48 3.48
Tag Text Themes Manually 3.60 3.60 3.60 3.60
Sentiment Analysis 3.61 3.61 0.00 0.00
Nvivo Integration 3.77 0.00 3.77 3.77
Automatic Tagging 3.56 0.00 3.56 0.00
Admin Dashboard 3.95 3.95 3.95 3.95
Ability to log into user account through Admin account 3.95 0.00 3.95 0.00
Encryption at Rest 3.95 3.95 3.95 3.95
Email Support (for users and admins) 3.95 3.95 3.95 3.95
Chat Support (for users and admins) 3.95 3.95 3.95 0.00
Set up User Groups (for shared access) 3.95 3.95 3.95 3.95
Dedicated Account Manager 3.95 3.95 3.95 3.95
Migrate Survey and Respondent Data 4.25 4.25 4.25 4.25
Account Management 3.95 3.95 3.95 3.95
Ability to set email limits per account 3.95 0.00 3.95 0.00
Ability to set access to different question types 3.95 0.00 0.00 0.00
Multiple Admin Accounts 3.95 3.95 3.95 3.95
Auto creation of accounts (via SSO) 3.95 3.95 3.95 3.95
HIPPA Compliant 3.95 3.95 3.95 3.95
FERPA Compliant 3.95 3.95 3.95 3.95
ADA Compliance (Fully Accessible/508 Compliant) 3.95 3.95 3.95 3.95
SAS no. 70
PCI DSS 0.00 3.95 3.95 3.95
ISO 27001 3.95 3.95 0.00 3.95
OWASP 3.95 0.00 3.95 0.00
Data Encryption in transit 3.95 3.95 3.95 3.95
Data Encryption at rest 3.95 3.95 3.95 3.95
Data Encryption on all backups 3.95 3.95 3.95 3.95

Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.

jamovi for R: Easy but Controversial

[An updated version of this post is located here.]

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, GraphPad Prism, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.

 

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.

 

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.