Blog

A Comparative Review of the BlueSky Statistics GUI for R

Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

 

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

 

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

 

Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open”:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • Excel (old and new xls file types)
  • Dbase’s DBF
  • SPSS (.sav)
  • SAS binary files (sas7bdat)
  • Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

  • Microsoft Access
  • Microsoft SQL Server
  • MySQL
  • PostgreSQL
  • SQLite

 

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

  1. Missing Values
  2. Compute
  3. Bin Numeric Variables
  4. Recode (able to recode many at once)
  5. Make Factor Variable (able to covert many at once)
  6. Transpose
  7. Transform (able to transform many at once)
  8. Sample Dataset
  9. Delete Variables
  10. Standardize Variables (able to standardize many at once)
  11. Aggregate (outputs results to a new dataset)
  12. Aggregate (outputs results to a printed table)
  13. Subset (outputs to a new data et)
  14. Subset (outputs results to a printed table)
  15. Merge Datasets
  16. Sort (outputs results to a new dataset)
  17. Sort (outputs results to a printed table)
  18. Reload Dataset from File
  19. Refresh Grid
  20. Concatenate Multiple Variables (handling missing values)
  21. Legacy (does same things but using base R code)
  22. Reshape (long to wide)
  23. Reshape (wide to long)

Continued here…

A Comparative Review of the Deducer GUI for R

Introduction

Deducer is a free and open source Graphical User Interface for the R software, one that provides beginners a way to point-and-click their way through analyses. It also integrates into an environment designed to help programmers be more productive. Deducer is available on Windows, Mac, and Linux; there is no server version.

This post one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. However, the reviews will include a cursory description of the programming support that each GUI offers.

Figure 1. JGR console with Deducer menus (left) and Deducer data viewer (right).

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi, BlueSky, or RKWard, install in a single step. Others, such as the R Commander and Rattle, install in multiple steps. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most are flooded with such calls at the beginning of each semester!

Deducer’s installation is quite complex:

  1. If you haven’t already done so, install the Java JRE. If you’re on Windows, I recommend the Windows x64 64-bit version.
  2. Download and install R. You should only need to keep the 64-bit version there too.
  3. Start R as an administrator, and from within it install Deducer and its companion IDE, the Java GUI for R (JGR, pronounced “jaguar”) using:
    packages(c(“JGR”,”Deducer”,”DeducerExtras”))
  4. Start JGR by submitting the commands:
    library(“JGR”)
    JGR()
  5. Within the JGR Console, start Deducer by choosing “Packages & Data> Package Manager” and clicking the checkboxes labeled “loaded” and “default” in front of both “Deducer” and “Deducer Extras”, then close the box.
  6. If you wish to get publication-quality output, download and install DeducerRichOutput from here.
  7. Finally, if you wish to start Deducer by clicking an icon (instead of typing two R commands) download the JGR launcher from here. If you have problems with this working start over while paying particular attention to where the instructions say, “as administrator.”

If your goal is to point-and-click your way through analyses, you probably won’t care for that much complexity. However, if your goal is to learn how to program in R, following those steps will help you on your way. Some of those steps are tasks you must learn when programming R.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (e.g. RKWard) through moderate (e.g. jamovi) to very active (e.g. R Commander).

Deducer has been in existence since 2009, and during that time nine plug-ins have been developed. Unfortunately there is no single place to go to find them. On the GUI’s “Packages & Data> GUI Add-ons” menu you’ll find four of them. Others are available here. The complete list of plug-ins that I could find is here:

  1. DeducerExtras: An add-on package containing a variety of additional analysis dialogs. These include: Distribution quantiles, single/multiple sample proportion tests, paired t-test, Wilcoxon signed rank test, Levene’s test, Bartlett’s test, k-means clustering, Hierarchical clustering, factor analysis, and multi-dimensional scaling
  2. DeducerPlugInScaling: Reliability and factor analysis
  3. DeducerMMR: Moderated multiple regression and simple slopes analysis
  4. DeducerRichOutput: writes results into true word processing tables with fonts and formatting
  5. DeducerSpatial: A GUI for Spatial Data Analysis and Visualization
  6. RDSAnalyst: Respondent Driven Sampling
  7. gMCP: (Experimental) A graphical approach to sequentially rejective multiple test procedures
  8. RGG: (Experimental) A GUI Generator
  9. DeducerText: (Experimental) Text Mining
  10. DeducerHansel: (Experimental) An add-on package which covers many methods common in econometrics, including binary logit, binary probit, and tobit estimates, and various time-series, panel, and spatial data methods. The time-series methods include cointegration analysis.

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and Rattle, have you start R, then load a package from your library, then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

On Deducer’s main web site, it recommends the following steps:

  1. Start R.
  2. Load the JGR package from your library by executing the command: “library(“JGR”)”.
  3. Start JGR by executing the command: “JGR()” and, if you followed the installation instructions above, JGR will start Deducer automatically. Both of the screens shown in Figure 1 will appear.

However, if you make it successfully through all seven installation steps described above, you can also start Deducer by double-clicking on the JGR Launcher icon.

 

Data Editor / Viewer

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

Deducer’s data editor is named Data Viewer. That can be confusing since many well-known software packages – including RStudio, the R Commander, and SAS Studio – use the term “viewer” for tools that let you see but not edit the data. The first time I used Deducer, I spent an embarrassing amount of time trying to find the “data editor” when it was right under my nose!

Figure 2. Deducer’s Data Viewer with the “Data View” tab selected (upper left). I have right-clicked on the variable name of “q2” and it displayed a menu of tasks to perform.

You can start Deducer’s Data Viewer by choosing “File> New Data”. You then provide a name, and click OK. You’ll see it execute a command like, “mydata <- data.frame()” but the Data Viewer may not show you an empty spreadsheet. It tends to lock onto your last data set, but you can choose the drop-down menu labeled “Data Set” to get to the name of the one you just started to create. An empty version of the screen shown in Figure 2 will appear.

You can start entering data immediately, though the variables will be named V1, V2,… at first. Numeric and character data will be fine, but don’t enter any other type of variables yet, such as dates. Before you go very far, it’s important to click on the “Variable View” tab and fill in your metadata, such as variable names, Type and Factor Level (see Figure 3). When the metadata are filled in, the data editor may wipe out any existing data! For example, if you enter some dates like “8/31/2018” it will be stored as character. If you then switch to the Variable View, and click on Type for that variable, and choose “Date” from the drop-down menu, the editor will delete the exiting dates.

This combination of Data View/Variable View is a common one which was made popular by SPSS. In that software it offers great power by letting you copy metadata from one variable to dozens of others. So you might have survey data where, 1=”Strongly Disagree”, 2=”Disagree”,…”5=”Strongly Agree”. SPSS would allow you to define this for one variable, the copy it and paste it into many others. Deducer’s Variable View does not allow that. You must work one variable at a time, which gets quite tedious.

To open an existing data set, choose “File> Open Data”. If it doesn’t appear in the Data Viewer window, choose it from the Data Set drop-down menu.

Figure 3. Deducer’s Data Viewer with the “Variable View” tab selected (upper left). This displays and lets us edit the metadata for the same data as shown in Figure 2.

Saving the data is done with the standard “File> Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, Deducer does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. However, you can also save or load multiple data sets by using JGR’s workspace save and open menu items. This strikes a good balance as beginners will relate to the simplicity of one-data-set-per-file, while advanced users will like the option to deal with more complex multi-object workspaces.

[Continued here…]

The Popularity of Point-and-Click GUIs for R

 

Point-and-click graphical user interfaces (GUIs) for R allow people to analyze data using the R software, without having to learn how to program in the R language. This is a brief look at how popular each one is. Knowing that a GUI is popular doesn’t mean it will meet your needs, but it does mean that it’s meeting the needs of many others. This may be helpful information when selecting the appropriate GUI for you, if programming is not your primary interest. For detailed information regarding what each GUI can do for you, and how it works, see my series of comparative reviews, which is currently in progress.

There are many ways to estimate the popularity of data science software, but one of the most accurate is by counting the number of downloads (see appendix for details). Figure 1 shows the monthly downloads of four of the six R GUIs that I’m reviewing (i.e. all that exist as far as I know).  We can see that the R Commander (Rcmdr) is the most popular GUI, and it has had steady growth since its introduction. Next comes Rattle, which is more oriented towards machine learning tasks. It too, has shown high popularity and steady growth.

The three lines at the bottom could use more “breathing room” so let’s look at them in their own plot.

Figure 1. Number of times each software was downloaded by month.

 

Figure 2 shows the same data as Figure 1, but with the two most popular GUIs removed to make room to study the remaining data. From it we can see that Deducer has been around for many more years than the other two. Downloads for Deducer grew steadily for a couple of years, then they leveled off. Its downloads appear to be declining slightly in recent years. jamovi (its name is not capitalized) has only been around for a brief period, and its growth has been very rapid. As you can see from my recent review, jamovi has many useful features.

Figure 2. Number of times the less popular GUIs were downloaded. (Same as Fig. 1, with the R Commander and rattle removed).

The lowest (blue) line shows downloads for the jmv package, that contains all the functions used by the jamovi GUI. It allows programmers to write code instead of using the jamovi GUI. People who point-and-click their way through an analysis in jamovi can send their code to any R user, who would then use the jmv package to run it. Since most jamovi users would prefer to point-and-click their way through analyses, it makes sense that the jmv package has been downloaded many fewer times than jamovi itself.

Two GUIs are missing from this plot: RKWard and BlueSky Statistics. Neither of those are downloaded from CRAN, and I was unable to obtain data from the developers of those GUIs. However, knowing that RKWard has a similar number of point-and-click features as Deducer, one can deduce (heh!) that it might have a similar level of popularity. The BlueSky software has only recently appeared on the scene, especially with its current level of features, so I expect it too will be towards the bottom, but growing rapidly.

I’m nearly done with all my reviews, so stay tuned to see what the other GUIs offer.

Acknowledgements

Thanks to Guangchuang Yu for making the dlstats package which allowed me to collect data so easily. Thanks also to Jonathon Love, who provided the download data for jamovi, and to Josh Price for his helpful editorial advice.

Appendix: Where the Data Came From

I used R’s dlstats package, which makes quick work of gathering counts of monthly downloads of R packages from the Comprehensive R Archive Network (CRAN). CRAN consists of sites around the world called “mirrors” from which people can download R packages. When starting the download process, R asks you to choose a mirror that is close to your location. In the popular RStudio development environment for R, the default mirror is set to their own server, which is actually a worldwide network of mirrors. Since it’s the default download location in a very popular tool for R, its download data will give us a good idea of the relative popularity of each GUI. The absolute popularity will be greater, but to get that data I would have to gather data from all the other servers around the world. If you have time to do that, please send me the results!

A Comparative Review of the RKWard GUI for R

Introduction

RKWard is a free and open source Graphical User Interface for the R software, one that supports beginners looking to point-and-click their way through analyses, as well as advanced programmers. You can think of it as a blend of the menus and dialog boxes that R Commander offers combined with the programming support that RStudio provides. RKWard is available on Windows, Mac, and Linux.

This review is one of a series which aims to help non-programmers choose the Graphical User Interface (GUI) that is best for them. However, I do include a cursory overview of how RKWard helps you work with code. In most sections, I’ll begin with a brief description of the topic’s functionality and how GUIs differ in implementing it. Then I’ll cover how RKWard does it.

Figure 1. RKWard’s main control screen containing an open data editor window (big one), an open dialog box (right) and its output window (lower left).

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They often don’t have the time required to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or BlueSky Statistics, install in a single step. Others install in multiple steps, such as R Commander and Deducer. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a single-step installation. I work at the University of Tennessee, and our HelpDesk is flooded with such calls at the beginning of each semester!

Installing RKWard on Windows is done in a single step since its installation file contains both R and RKWard. However, Mac and Linux users have a two-step process, installing R first, then download RKWard which links up to the most recent version of R that it finds. Regardless of their operating system, RKWard users never need to learn how to start R, then execute the install.packages function, and then load a library.  Installers for all three operating systems are available here.

The RKWard installer obtains the appropriate version of R, simplifying the installation and ensuring complete compatibility. However, if you already had a copy of R installed, depending on its version, you could end up with a second copy.

RKWard minimizes the size of its download by waiting to install some R packages until you actually try to use them for the first time. Then it prompts you, offering default settings that will get the package you need.

On Windows, the installation file is 136 megabytes in size.

 

Plug-ins

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling section of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, BlueSky, Deducer) through moderate (jamovi) to very active (R Commander).

Currently all plug-ins are included with the initial installation.  You can see them using the menu selection Settings> Configure Packages> Manage RKWard Plugins. There are only brief descriptions of what they do, but once installed, you can access the help files with a single click.

RKWard add-on modules are part of standard R packages and are distributed on CRAN. Their package descriptions include a field labeled, “enhances: rkward”. You can sort packages by that field in RKWard’s package installation dialog where they are displayed with the RKWard icon.

Continued here…

A Review of Qualtrics, QuestionPro, REDCap, SurveyGizmo, & SurveyMonkey

 Introduction

Web-based surveys offer a quick and effective way to collect data. Several companies sell software-as-a-service which makes the construction of surveys quite easy using only a web browser. At the University of Tennessee, we currently have a system-wide site license for Qualtrics.  Initial discussions suggested an intent from Qualtrics to more than double its price from the previous year. This article describes the process we followed to evaluate and select an alternative, in the interest of providing similar functionality at a reasonable cost.

In choosing the options to review, our first step was to search for online software reviews of tools that we knew were already in use by various groups across all UT campuses and institutes. If one of those provided similar features at a lower price, we could minimize our training and migration expenses. A summary of the reviews’ overall ratings is shown in Table 1.

Source Qualtrics Question Pro REDcap Survey Gizmo Survey Monkey
Captera 5 4.5 5 4.5
Comparisons 4.4 4.45 4.55 4.6
G2Crowd 4.4 4 4.5 4.5
GetApp 5 4.56 5 4.5
SMB Guru 4.25 4.5 3.75
TopTen Reviews   4.965 4.735
Mean Score 4.61 4.5 4 4.75 4.43

Table 1. Summary of web survey ratings from online review sites. Some scores are rescaled to range from 1 through 5. While most scores are available directly from the Source link, to get a score on the Comparisons site you must first choose a comparison of two products, then click on one of them to get its full review.

 

The tools were all highly rated, but we wondered if the raters needed as many features as we use for academic research. A search using Google Scholar confirmed that these were the five survey tools most widely used in scholarly research.

We read all we could find regarding the financial state of each company, read about what  employees said about them on Glassdoor.com, and searched for complaints of any sort regarding the companies or their products. We found no significant problems in any of those areas. The companies all seem to be growing quite rapidly.

We also considered using open source software. However, the most popular open source web survey tool is LimeSurvey, and our previous experience with it was not positive. We investigated reviews of LimeSurvey to see if it had improved since we used it last, but there were very few of them compared to the others.

Selection Process

We formed a committee of ten university faculty and staff, all with substantial survey design expertise. The committee compiled a list of 141 web survey features that we considered important. We then surveyed the university research community, including our current survey tool users, asking them to rate the importance of each feature on a 5-point Likert scale. The detailed table of features and importance ratings appears in the appendix.

We used the features list to write the specifications for a Request for Qualified Suppliers (RFQ-S), a type of Request for Proposal. We specified a 5-year contract including separate pricing for: individuals, groups (e.g. departments, colleges, or institutes), single campus, and multi-campuses; as well as internal use, external use with not-for-profit clients, and external use with for-profit clients.

External use is important because the University of Tennessee is a Carnegie Engaged University, which means one of its prime goals is to perform collaborative research with external organizations. The for-profit category was included because students like to solve the types of problems that companies provide. Our Institute for Public Service also occasionally does surveys for companies in Tennessee. Such use is often prohibited from academic software licenses.

The RFQ-S was sent to the five vendors shown in Table 1. To keep things comparable, REDcap Cloud was chosen as the vendor for REDcap. They offer the same type of software-as-a-service as the other venders, though REDcap is also available for on-premises installations for free.

The bids we received covered an extremely wide range, with the highest price more than ten times larger than the lowest. Each of the responding vendors specified which of the 141 features they offered. Only one vendor, Qualtrics, stated that it was not acceptable to use their software for the benefit of any type of external organizations. Qualtrics requires an expensive commercial license when third parties are involved.

We then tested each feature, verifying the companies’ claims. We considered rating each for ease-of-use and effectiveness, but found that if the feature was implemented at all, it was generally both easy to use and effective.

We then created a composite score which weighted each feature according to the ratings from the feature importance survey. Our purchasing department prefers 1,000-point scales, so we adjusted the scores accordingly. A perfect score of 1,000 would indicate software that offered every feature that a demanding person would describe as Very Important. The resulting scores are shown in Table 2. Readers can use the data in the appendix to develop their own scores. Since REDcap Cloud did not respond to our bid, we did not evaluate or score it. However, does offer an extensive feature set including some very advanced features for database use and clinical trials. We are already using the free on-site version for projects that require those features, but it’s not as easy to use as the other products.

Qualtrics QuestionPro SurveyGizmo SurveyMonkey
Total number
of features
(out of 141)
134 132 130 111
Raw Score 541 529 524 456
Scaled score
(x 1.418)
767 750 744 647

Table 2. Feature counts and scores weighted by feature importance. Each feature is listed in the appendix. See vendors for pricing.

 

QuestionPro and SurveyGizmo offered nearly identical feature sets to Qualtrics, with equivalent ease-of-use, at significantly lower prices. SurveyMonkey offered a smaller set of features, at a price that was much lower than Qualtrics, but still much higher than the others.

We called the three references that QuestionPro and SurveyGizmo had each provided. They all reported that the software worked well, had close to zero downtime, had technical support that was quick and competent, and they all reported planning to continue using their chosen vendor well into the future.

QuestionPro not only scored the highest on the 141 attributes we initially focused on, but it also currently offers advanced features that we had not even considered. The company’s future features roadmap is substantially more advanced than any other product currently on the market, and anything we heard regarding the other vendors’ future plans. As a result, we decided to go with QuestionPro.

Migration Issues

We have changed web survey tools twice before, so we are well aware of the challenges involved. When moving to a new software platform, the software cost is important, but so are migration costs. For example, if we were to attempt a migration from SAS to R in one year’s time, the cost to train people and convert tens of thousands of programs would far outweigh the savings (at least at educational prices). However, survey software is relatively easy to learn, taking an hour to learn how to set up a typical survey. We estimate that a typical survey could be migrated in well under an hour. Complex ones might take much longer, especially those that must migrate longitudinal data along with the survey.

The Qualtrics administrative dashboard allows us to download extensive details about how our people have used the software over the last five years. The total number of accounts is intimidating, at just over  11,000. However, thousands of accounts contain only a single survey with a single response, indicating that the person was simply trying the software out. Thousands more accounts have not been logged into in years. We learned that 80% of surveys each year are created by new users who are starting from scratch and thus have no work to migrate.

We have developed preliminary estimates of migration effort from current usage, and they range between a total of 1,000 and 4,000 hours. Our support staff will be available to help users migrate their projects. We are developing a survey to get more details from current users to better assess migration needs. This will allow us to determine the number of projects that entail additional complexity such as multi-institutional collaboration or longitudinal studies that must maintain long-term data compatibility. We will also be recording the time it takes to move surveys and data to the new system. As we collect such data, we will be able to forecast how long the migration will take.

Conclusion

Our work resulted in the selection of a software package, QuestionPro, which is comparable to Qualtrics at a much lower price. Comparing our final expenditure to the initial price increase that our Qualtrics sales rep requested, the savings over the 5-year life of the contract is over $900,000. In addition, we now have a product that members of the university community can use to solve problems for companies, helping to engage UT more fully into the economy of Tennessee.

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter.

 

Appendix: Web survey tool features and their importance ratings

The importance of each feature (1=very low importance…5= very important) was determined by current users of web survey tools at the University of Tennessee. A rating of zero indicates that the software lacks that feature. This information was current as of 12/1/2017; the vendors all regularly add new features so check their web sites for their latest information.

Feature Qualtrics QuestionPro SurveyGizmo SurveyMonkey
Display Text (e.g. instructions) 4.46 4.46 4.46 4.46
Skip Logic 4.65 4.65 4.65 4.65
Display Logic 4.65 4.65 4.65 4.65
Required Questions / Required Answers 4.64 4.64 4.64 4.64
Redirect Browser at end of survey 3.83 3.83 3.83 3.83
Anonymous Responses (separate survey data from distribution/contact information) 4.51 4.51 4.51 4.51
Embedded Data/Hidden Values/Custom Values 3.85 3.85 3.85 3.85
Survey Collaboration 4.51 4.51 4.51 4.51
Single Response 4.68 4.68 4.68 4.68
Multiple Response 4.70 4.70 4.70 4.70
Likert Grid/Matrix 4.60 4.60 4.60 4.60
Open-ended text 4.60 4.60 4.60 4.60
Email distribution 4.72 4.72 4.72 4.72
Contact Management 4.50 4.50 4.50 4.50
Send Reminders 4.57 4.57 4.57 4.57
Export to CSV 4.69 4.69 4.69 4.69
Export to at least one statistics package (such as SAS, SPSS…) 4.51 4.51 4.51 4.51
SSO login 4.51 4.51 4.51 4.51
Phone Support (for users and admins) 4.51 4.51 4.51 4.51
Integrated Question Design Methodology Advice 0.00 0.00 0.00 3.96
Machine Learning to Improve Response Rate 0.00 3.84 0.00 3.84
Randomization of Questions 3.71 3.71 3.71 3.71
Randomization of Responses 3.57 3.57 3.57 3.57
A/B Test Questions (Split respondents between scenarios) 3.82 3.82 3.82 3.82
Multi-Lingual Surveys 3.68 3.68 3.68 3.68
Quiz Development 3.74 3.74 3.74 3.74
Score Survey 3.89 3.89 3.89 3.89
Custom Survey Templates 4.23 4.23 4.23 4.23
Branch Logic 4.62 4.62 4.62 4.62
Preview Survey for Standard Screen Size 4.58 4.58 4.58 4.58
Preview Survey for Phone Screen Size 4.58 4.58 4.58 4.58
Easy to Create Interface 3.95 3.95 3.95 3.95
Recode Values/Set Reporting Values (i.e. set Yes to 1 and No to 0  or recode on an unsual scale such as 0,1,3,5) 4.16 4.16 4.16 4.16
Custom Javascript Support 3.59 3.59 3.59 0.00
Loop&Merge / Page Piping (loop through the same set of questions a given number of times) 3.94 3.94 3.94 3.94
Insert Piped Text from question or custom value into question text 4.02 4.02 4.02 4.02
Insert Piped Text from question or custom value into question response 3.95 3.95 3.95 3.95
Carry Forward selected answers to populate future questions 4.11 4.11 4.11 4.11
Carry Forward response choices to populate responses on future questions 4.10 4.10 4.10 0.00
Soft-Require Question (request response) 4.41 4.41 4.41 0.00
Optional Progress Bar 4.21 4.21 4.21 4.21
API Integration 3.89 3.89 3.89 3.89
GoogleForm Integration 0.00 3.82 0.00 3.82
Email Triggers/Action Alerts (trigger emails automatically to be sent based on survey responses) 4.17 4.17 4.17 4.17
Contact List Triggers (add people to contact list automatically based on how they answer a survey) 3.98 3.98 0.00 0.00
Edit Next, Submit and Close Button Text 4.23 4.23 4.23 4.23
Hide Next, Submit or Close buttons 4.03 4.03 4.03 0.00
File Library 4.08 4.08 4.08 4.08
Import Surveys Questions from Word 0.00 4.28 4.28 4.28
Export Survey Questions to Word 4.39 4.39 4.39 0.00
Import/Export Survey file (to create a copy that can be exported/imported) 4.61 4.61 4.61 4.61
Copy Survey within survey tool 4.64 4.64 4.64 4.64
Revert to previous versions 4.16 4.16 4.16 0.00
Password/contact list  authentication within a survey (respondent must authenticate with credentials saved within the contact list to continue) 3.90 3.90 3.90 3.90
SSO authentication within a survey (respondent must authenticate with SSO credentials to continue) 3.78 3.78 3.78 3.78
Connect to Web Services (such as random number generator) 3.79 3.79 3.79 0.00
Survey Collaboration within the university/license/brand 4.30 4.30 4.30 4.30
Survey Collaboration outside of the university/license/brand 3.81 0.00 3.81 3.81
Numeric Entry 4.41 4.41 4.41 4.41
Constant Sum 3.56 3.56 3.56 0.00
Slider 3.63 3.63 3.63 3.63
Ranking 4.19 4.19 4.19 4.19
Descriptive Text/Graphic 4.46 4.46 4.46 4.46
Side by Side 3.99 3.99 3.99 3.99
Captcha 3.01 3.01 3.01 0.00
Signature 3.25 3.25 3.25 0.00
Closed Card Sort Questions 3.08 3.08 3.08 0.00
Image Heatmap Questions 2.88 2.88 2.88 0.00
Open Card Sort Questions 0.00 3.08 3.08 0.00
Semantic Differential Questions 3.28 3.28 3.28 3.28
Text Highlighter Questions 3.53 3.53 3.53 0.00
Choice Based Conjoint 3.35 3.35 3.35 0.00
Max Differential (Max Diff) Questions 3.39 3.39 3.39 0.00
Track Time Respondent Stays on a Question or  Page 3.55 3.55 3.55 0.00
Audio/Video Sentiment Questions (slider recording response while video/audio plays) 0.00 3.45 3.45 0.00
File Upload 4.07 4.07 4.07 4.07
Geo-Targeting/Tracking 3.20 3.20 3.20 3.20
Embed External Audio and Video Files 3.81 3.81 3.81 3.81
Real-time Responses 4.13 4.13 4.13 4.13
Filter Responses 4.39 4.39 4.39 4.39
Printed Reports 4.37 4.37 4.37 4.37
Cross-Tabulation Reporting 4.31 4.31 4.31 4.31
Variable Creation 4.20 4.20 0.00 0.00
Response Editing 4.05 4.05 4.05 4.05
View Reports Online 4.67 4.67 4.67 4.67
Export report from within offline report (pdf or other format) 4.51 4.51 4.51 4.51
Import response data from Excel or other format 4.26 4.26 4.26 0.00
Manually enter response data collected externally 4.16 4.16 4.16 4.16
Export reports (Please attach list formats) 4.67 4.67 4.67 4.67
Export Individual Charts 4.39 4.39 4.39 4.39
Create reports from multiple data sources 4.35 4.35 4.35 4.35
Share Contact Lists 4.19 4.19 4.19 4.19
Group organization 4.33 4.33 4.33 4.33
Social Media Distribution 3.88 3.88 3.88 3.88
SMS Distribution 3.60 3.60 0.00 0.00
Template Library of Messages 4.06 4.06 0.00 4.06
Embed First Question in Email Invite 3.29 3.29 3.29 3.29
Mobile First UX 3.95 3.95 3.95 3.95
Response Panels 2.86 2.86 2.86 2.86
URL Shortener 3.91 3.91 3.91 3.91
Survey Quotas 3.43 3.43 3.43 3.43
Schedule email invites and reminders 4.41 4.41 4.41 4.41
Schedule survey to close 4.53 4.53 4.53 4.53
Resend Link after completion 3.31 3.31 3.31 3.31
Send email to Contact List without survey link 3.32 0.00 3.32 3.32
Optional opt out link in emails 4.07 4.07 4.07 4.07
Stand alone Mobile App 4.17 4.17 0.00 4.17
Offline Mode 3.65 3.65 3.65 0.00
SPSS 4.32 4.32 4.32 4.32
PDF 4.31 4.31 4.31 4.31
PowerPoint 3.93 3.93 3.93 3.93
Change variable labels and/or response values before export 4.19 0.00 4.19 4.19
Select if data are exported as Response Text or Response Value 4.37 4.37 4.37 4.37
Export subset of data 4.43 4.43 4.43 4.43
Word Cloud 3.48 3.48 3.48 3.48
Tag Text Themes Manually 3.60 3.60 3.60 3.60
Sentiment Analysis 3.61 3.61 0.00 0.00
Nvivo Integration 3.77 0.00 3.77 3.77
Automatic Tagging 3.56 0.00 3.56 0.00
Admin Dashboard 3.95 3.95 3.95 3.95
Ability to log into user account through Admin account 3.95 0.00 3.95 0.00
Encryption at Rest 3.95 3.95 3.95 3.95
Email Support (for users and admins) 3.95 3.95 3.95 3.95
Chat Support (for users and admins) 3.95 3.95 3.95 0.00
Set up User Groups (for shared access) 3.95 3.95 3.95 3.95
Dedicated Account Manager 3.95 3.95 3.95 3.95
Migrate Survey and Respondent Data 4.25 4.25 4.25 4.25
Account Management 3.95 3.95 3.95 3.95
Ability to set email limits per account 3.95 0.00 3.95 0.00
Ability to set access to different question types 3.95 0.00 0.00 0.00
Multiple Admin Accounts 3.95 3.95 3.95 3.95
Auto creation of accounts (via SSO) 3.95 3.95 3.95 3.95
HIPPA Compliant 3.95 3.95 3.95 3.95
FERPA Compliant 3.95 3.95 3.95 3.95
ADA Compliance (Fully Accessible/508 Compliant) 3.95 3.95 3.95 3.95
SAS no. 70
PCI DSS 0.00 3.95 3.95 3.95
ISO 27001 3.95 3.95 0.00 3.95
OWASP 3.95 0.00 3.95 0.00
Data Encryption in transit 3.95 3.95 3.95 3.95
Data Encryption at rest 3.95 3.95 3.95 3.95
Data Encryption on all backups 3.95 3.95 3.95 3.95

Using Excel for Data Entry

This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdf and http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction.

Simple Data Sets

Most data sets are easy to enter using the following rules.

  • All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
  • Enter variable names in the first row of the spreadsheet.
  • Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
  • Variable names should not contain spaces, but may use the underscore character.
  • No other text rows such as titles should be in the spreadsheet.
  • No blank rows should appear in the data.
  • Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
  • Position the ID variable in the left-most column for easy reference. 
  • If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
  • Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
  • Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
  • If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
    If the data represent gender, it’s common to use 0 for female, 1 for male.
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
  • You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
  • For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.

Relational Data Sets

Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).

If you have such data, make sure that each data set contains a “key” variable that acts as a  common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.

Example of a Good Data Structure

This data set follows all the rules for simple data sets above. Any statistics software can read it easily.

ID
Gender Income

1

0

32000

2

1

23000

3

0

137000

4

1

54000

5

1

48500

Example of a Bad Data Structure

This is the same data shown above, but it violates the rules for simple data sets in several ways: there is no column for gender, the income values contain dollar signs and commas, variable names appear on more than one line, variable names are not even consistent (income vs. salary), and there is a blank line in the middle. This would not be easy to read!

Data for Female Subjects
ID Income

1

$32,000

3

$137,000

   
Data for Male Subjects
ID Salary

2

$23,000

4

$54,000

5

$48,500

Excel Tips for Data Entry

  • You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
  • Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
  • If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
  • To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
  • The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
    =A1<>Sheet2!A1
    then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
  • When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.

Backups

Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.

 Steps for Reading Excel Data Into R

There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see:
http://r4stats.com/examples/data-import/.

# Do this once to install:
install.packages("readxl")

# Each time you read a file, follow these steps
library("readxl")
mydata <- read_excel("mydata.xlsx")
mydata 

Steps for Reading Excel Data Into SPSS

  1. In SPSS, choose File> Open> Data.
  2. Change the “Files of file type” box to “Excel (*.xlsx)”
  3. When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
  4. When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
  5. Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
  6. Next time open the .sav version, you won’t need to convert the file again.
  7. If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK. 

Steps for Reading Excel Data Into SAS

The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/

proc import datafile="mydata.xlsx"
dbms=xlsx out=sasuser.mydata replace;
getnames=yes;
run;

Steps for Reading Excel Data into jamovi

At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here:
http://r4stats.com/2018/02/13/jamovi-for-r-easy-but-controversial/.

More to Come

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter where I announce my blog posts.

Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms
Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.

jamovi for R: Easy but Controversial

[An updated version of this post is located here.]

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, GraphPad Prism, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.

 

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.

 

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.

Dueling Data Science Surveys: KDnuggets & Rexer Go Live

What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).