Data Mangement | r4stats.com

Updated Comparison of R Graphical User Interfaces

I have just updated my detailed reviews of Graphical User Interfaces (GUIs) for R, so let’s compare them again. It’s not too difficult to rank them based on the number of features they offer, so let’s start there. I’m basing the counts on the number of dialog boxes in each category of four categories:

Ease of Use
General Usability
Graphics
Analytics

This is trickier data to collect than you might think. Some software has fewer menu choices, depending instead on more detailed dialog boxes. Studying every menu and dialog box is very time-consuming, but that is what I’ve tried to do. I’m putting the details of each measure in the appendix so you can adjust the figures and create your own categories. If you decide to make your own graphs, I’d love to hear from you in the comments below.

Figure 1 shows how the various GUIs compare on the average rank of the four categories. R Commander is abbreviated Rcmdr, and R AnalyticFlow is abbreviated RAF. We see that BlueSky is in the lead with R-Instat close behind. As my detailed reviews of those two point out, they are extremely different pieces of software! Rather than spend more time on this summary plot, let’s examine the four categories separately.

Figure 1. Mean of each R GUI’s ranking of the four categories. To make this plot consistent with the others below, the larger the rank, the better.

For the category of ease-of-use, I’ve defined it mostly by how well each GUI does what GUI users are looking for: avoiding code. They get one point each for being able to install, start, and use the GUI to its maximum effect, including publication-quality output, without knowing anything about the R language itself. Figure two shows the result. JASP comes out on top here, with jamovi and BlueSky right behind.

Figure 2. The number of ease-of-use features that each GUI has.

Figure 3 shows the general usability features each GUI offers. This category is dominated by data-wrangling capabilities, where data scientists and statisticians spend most of their time. This category also includes various types of data input and output. BlueSky and R-Instat come out on top not just due to their excellent selection of data wrangling features but also due to their use of the rio package for importing and exporting files. The rio package combines the import/export capabilities of many other packages, and it is easy to use. I expect the other GUIs will eventually adopt it, raising their scores by around 40 points. JASP shows up at the bottom of this plot due to its philosophy of encouraging users to prepare the data elsewhere before importing it into JASP.

Figure 3. Number of general usability features for each GUI.

Figure 4 shows the number of graphics features offered by each GUI. R-Instat has a solid lead in this category. In fact, this underestimates R-Instat’s ability if you…

Continued…

A Comparative Review of the R-Instat GUI for R

by Robert A. Muenchen

Introduction

R-Instat is a free and open source graphical user interface for the R software that focuses on people who want to point-and-click their way through data science analyses. Written in Visual Basic, it is currently only available for Microsoft Windows. However, a Linux version is in development using the cross-platform Mono implementation of the .NET framework.This post is one of a series of reviews that aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Although I wrote the BlueSky User’s Guide, I hope to remain objective in these reviews. There is no one perfect user interface for everyone; each GUI for R has features that appeal to a different set of people.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

R-Instat is easy to install, requiring only a single step. It provides its own embedded copy of R. This simplifies the installation and ensures complete compatibility between R-Instat and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have R-Instat control any version of R you choose, but if the version differs too much, you may run into occasional problems.

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” that add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Rattle, Deducer) through medium (JASP 15) to high (jamovi 43, R Commander 43).

While the R-Instat project welcomes contributions from anyone, there are not any modules to add at this time. All of its capabilities are included in its initial installation.

Startup

Some user interfaces for R, such as jamovi or JASP, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and then finally call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start R-Instat directly by double-clicking its icon from your desktop or choosing it from your Start Menu (i.e., not from within R).

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

R-Instat starts up by showing its screen (Fig. 1). Under Start, I chose “New Data Frame” and it showed me the rather perplexing dialog shown in Fig. 2.

As an R user, I know what expressions are, but what did the R-Instat designers mean by the term?

Clicking the “Construct Examples” button brought up the suggestions shown in Fig. 3. These are standard R expressions, which came as quite a surprise! It seems that the R-Instat designers are wanting to get people to start using R programming code immediately.

Figure 3. Examples R-Instat provides for expression you can use to create a dataset.

Clicking the Help button brings up the advice, “the simplest option is Empty” (the developers say this will become the default in a future version). Clicking that button brings up a simple prompt for the number of rows and columns you would like to create. After that, you’re looking at a basic spreadsheet (Fig. 4) that easily lets you enter data. As you enter data, it determines if it is numeric or character. Scientific notation is accepted, but dates are saved as character variables. Logical values (TRUE, FALSE) are recognized as such and are stored appropriately.

Right-clicking on any column allows you to convert variables to be a factor, ordered factor, numeric, logical, or character. These changes are recorded as function calls to a custom “convert_column_to_type” function for reproducibility. Such interactive changes are not usually recorded by other R GUIs. Date/time conversion is not available on that menu, as that process is trickier. Those conversions are on the “Prepare> Column Date” menu item. Other things you can do from the right-click menu are: rename, duplicate, reorder, set levels/labels, sort, and filter/remove filter.

The class of each variable is indicated by a character code that follows each variable name in parenthesis: (C) for character, (F) for factor, (O.F) for ordered factor, (D) for date, (L) for logical. When no code follows a variable name, it is numeric.

Figure 4. The R-Instat Data View (left) and Output Window (right).

The name of the dataset appears on a tab at the bottom of the Data View window. This lets you easily manage multiple datasets, an ability that is popular among professionals, but which is rarely offered in R GUIs (BlueSky and R Commander are the only others that offer it).

Once the dataset is saved, to add rows or columns you choose, “Prepare > Data Frame > Insert rows/columns” to add new rows or columns at any position in the data frame. New columns can be added with a specified default value, which can be a big time-saver when entering blocks of related data.

There is a quicker method that works for inserting new rows. You right-click the row numbers and a pop-up menu will allow you to insert rows above or below, and the number of rows selected is the number of rows added – like in Excel.

When editing data, R-Instat lets you type new values on top of the old. As soon as you press the Enter key, it generates R code to execute the change. For example, in a language variable, when changing the value “English” to “Spanish,” it wrote,

Replace Value in Data
data_book$replace_value_in_data(data_name="wakefield", col_name="Language", rows="78", new_value="Spanish")

This is important for reproducibility, but R-Instat is the only GUI reviewed here that tracks such important manual changes. In fact, even among expensive proprietary software, Stata is the only one that I’m aware of that keeps track of such changes using code.

If you have another data set to enter, you can restart the process by choosing “File> New Data…” again. You can change data sets simply by clicking on its tab, and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that is displayed in the editor does not influence what appears in dialog boxes. That means that you can be looking at one dataset while analyzing another! Since each dialog allows you to choose the dataset to use, that is technically not a problem, but if you have several datasets that contain the same variable names, remember that what you see may not be what you get! That’s the opposite of BlueSky Statistics, which automatically analyzes the dataset you see. R-Instat’s ability to work with multiple datasets in a single instance of the software is not a feature found in all R GUIs. For example, jamovi and JASP can only work with a single dataset at a time.

Saving the data is done with a fairly standard “File> Save As> Save Dataset As” menu. By default it will save all open datasets, filters, graphs, and models to a single file called a “data book.” That makes working with complex projects much easier to open and close.

Data Import

R-Instat supports the following file formats, most of which are automatically opened using “File> Import from File”. The ODK and NetCDF file formats have their own Import menus. R-Instat’s ability to open many formats related to climate science hints at what the software excels at. For details, see the Analysis Methods section below.

Comma Separated Values (.csv)
Plain text files (.txt)
Excel (old and new xls file types)
xBASE database files (dBase, etc.)
SPSS (.sav)
SAS binary files (sas7bdat and *.xpt)
Standard R workspace files (RData, but it just opens one dataframe of its choosing)
Open Data Kit (ODK)
OpenRefine
Network Common Data Form (NetCDF)
SST Sea Surface Temperature formatted files
IRI Data Library (API download)
Climate Data Store (CDS) (API download)
Shapefile
Climsoft (Climatic database)
.dly (ASCII files)
.dat (ASCII files)
Tab Separated Values (.tsv)
Stata (.dta)
JSON (.json)
epiinfo (.rec)
Minitab (.mtb)
Systat (.syd).
CSV with a YAML metadata header (.csvy)
Feather R/Python interchange format (.feather)
Pipe separated files (.psv)
YAML (.yml)
Weka Attribute-Relation File Format (.arff)
Data Interchange Format (.dif)
OpenDocument Spreadsheet (*.ods)
Shallow XML documents (*.xml)
Single-table HTML documents (*.html)

Continued…

R GUI Update: BlueSky User’s Guide, New Features

The BlueSky Statistics graphical user interface (GUI) for the R language has added quite a few new features (described below). I’m also working on a BlueSky User Guide, a draft of which you can read about and download here. [Update: don’t download that, get the full Intro Guide download instead.] Although I’m spending a lot of time on BlueSky, I still plan to be as obsessive as ever about reviewing all (or nearly all) of the R GUIs, which is summarized here.

The new data management features in BlueSky are:

Date Order Check — this lets you quickly check across the dates stored in many variables, and it reports if it finds any rows whose dates are not always increasing from left to right.
Find Duplicates – generates a report of duplicates and saves a copy of the data set from which the duplicates are removed. Duplicates can be based on all variables, or a set of just ID variables.
Select First/Last Observation per Group – finding the first or last observation in a group can create new datasets from the “best” or “worst” case in each group, find the most current record, and so on.

Model Fitting / Tuning

One of the more interesting features in BlueSky is its offering of what they call Model Fitting and Model Tuning. Model Fitting gives you direct control over the R function that does the work. That provides precise control over every setting, and it can teach you the code that the menus create, but it also means that model tuning is up to you to do. However, it does standardize scoring so that you do not have to keep up with the wide range of parameters that each of those functions need for scoring. Model Tuning controls models through the caret package, which lets you do things like K-fold cross-validation and model tuning. However, it does not allow control over every model setting.

New Model Fitting menu items are:

Cox Proportional Hazards Model: Cox Single Model
Cox Multiple Models
Cox with Formula
Cox Stratified Model
Extreme Gradient Boosting
KNN
Mixed Models
Neural Nets: Multi-layer Perceptron
NeuralNets (i.e. the package of that name)
Quantile Regression

There are so many Model Tuning entries that it’s easier to just paste in the list I updated on the main BlueSkly review that I updated earlier this morning:

Model Tuning: Adaboost Classification Trees
Model Tuning: Bagged Logic Regression
Model Tuning: Bayesian Ridge Regression
Model Tuning: Boosted trees: gbm
Model Tuning: Boosted trees: xgbtree
Model Tuning: Boosted trees: C5.0
Model Tuning: Bootstrap Resample
Model Tuning: Decision trees: C5.0tree
Model Tuning: Decision trees: ctree
Model Tuning: Decision trees: rpart (CART)
Model Tuning: K-fold Cross-Validation
Model Tuning: K Nearest Neighbors
Model Tuning: Leave One Out Cross-Validation
Model Tuning: Linear Regression: lm
Model Tuning: Linear Regression: lmStepAIC
Model Tuning: Logistic Regression: glm
Model Tuning: Logistic Regression: glmnet
Model Tuning: Multi-variate Adaptive Regression Splines (MARS via earth package)
Model Tuning: Naive Bayes
Model Tuning: Neural Network: nnet
Model Tuning: Neural Network: neuralnet
Model Tuning: Neural Network: dnn (Deep Neural Net)
Model Tuning: Neural Network: rbf
Model Tuning: Neural Network: mlp
Model Tuning: Random Forest: rf
Model Tuning: Random Forest: cforest (uses ctree algorithm)
Model Tuning: Random Forest: ranger
Model Tuning: Repeated K-fold Cross-Validation
Model Tuning: Robust Linear Regression: rlm
Model Tuning: Support Vector Machines: svmLinear
Model Tuning: Support Vector Machines: svmRadial
Model Tuning: Support Vector Machines: svmPoly

You can download the free open-source version from https://BlueSkyStatistics.com.

A Comparative Review of the BlueSky Statistics GUI for R

Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses. A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

Data Editor

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open”:

Comma Separated Values (.csv)
Plain text files (.txt)
Excel (old and new xls file types)
Dbase’s DBF
SPSS (.sav)
SAS binary files (sas7bdat)
Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

Microsoft Access
Microsoft SQL Server
MySQL
PostgreSQL
SQLite

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

Missing Values
Compute
Bin Numeric Variables
Recode (able to recode many at once)
Make Factor Variable (able to covert many at once)
Transpose
Transform (able to transform many at once)
Sample Dataset
Delete Variables
Standardize Variables (able to standardize many at once)
Aggregate (outputs results to a new dataset)
Aggregate (outputs results to a printed table)
Subset (outputs to a new data et)
Subset (outputs results to a printed table)
Merge Datasets
Sort (outputs results to a new dataset)
Sort (outputs results to a printed table)
Reload Dataset from File
Refresh Grid
Concatenate Multiple Variables (handling missing values)
Legacy (does same things but using base R code)
Reshape (long to wide)
Reshape (wide to long)

Continued here…

A Review of Qualtrics, QuestionPro, REDCap, SurveyGizmo, & SurveyMonkey

Introduction

Web-based surveys offer a quick and effective way to collect data. Several companies sell software-as-a-service which makes the construction of surveys quite easy using only a web browser. At the University of Tennessee, we currently have a system-wide site license for Qualtrics. Initial discussions suggested an intent from Qualtrics to more than double its price from the previous year. This article describes the process we followed to evaluate and select an alternative, in the interest of providing similar functionality at a reasonable cost.

In choosing the options to review, our first step was to search for online software reviews of tools that we knew were already in use by various groups across all UT campuses and institutes. If one of those provided similar features at a lower price, we could minimize our training and migration expenses. A summary of the reviews’ overall ratings is shown in Table 1.

Source	Qualtrics	Question Pro	REDcap	Survey Gizmo	Survey Monkey
Captera	5	4.5		5	4.5
Comparisons	4.4	4.45		4.55	4.6
G2Crowd	4.4		4	4.5	4.5
GetApp	5	4.56		5	4.5
SMB Guru	4.25			4.5	3.75
TopTen Reviews				4.965	4.735
Mean Score	4.61	4.5	4	4.75	4.43

Table 1. Summary of web survey ratings from online review sites. Some scores are rescaled to range from 1 through 5. While most scores are available directly from the Source link, to get a score on the Comparisons site you must first choose a comparison of two products, then click on one of them to get its full review.

The tools were all highly rated, but we wondered if the raters needed as many features as we use for academic research. A search using Google Scholar confirmed that these were the five survey tools most widely used in scholarly research.

We read all we could find regarding the financial state of each company, read about what employees said about them on Glassdoor.com, and searched for complaints of any sort regarding the companies or their products. We found no significant problems in any of those areas. The companies all seem to be growing quite rapidly.

We also considered using open source software. However, the most popular open source web survey tool is LimeSurvey, and our previous experience with it was not positive. We investigated reviews of LimeSurvey to see if it had improved since we used it last, but there were very few of them compared to the others.

Selection Process

We formed a committee of ten university faculty and staff, all with substantial survey design expertise. The committee compiled a list of 141 web survey features that we considered important. We then surveyed the university research community, including our current survey tool users, asking them to rate the importance of each feature on a 5-point Likert scale. The detailed table of features and importance ratings appears in the appendix.

We used the features list to write the specifications for a Request for Qualified Suppliers (RFQ-S), a type of Request for Proposal. We specified a 5-year contract including separate pricing for: individuals, groups (e.g. departments, colleges, or institutes), single campus, and multi-campuses; as well as internal use, external use with not-for-profit clients, and external use with for-profit clients.

External use is important because the University of Tennessee is a Carnegie Engaged University, which means one of its prime goals is to perform collaborative research with external organizations. The for-profit category was included because students like to solve the types of problems that companies provide. Our Institute for Public Service also occasionally does surveys for companies in Tennessee. Such use is often prohibited from academic software licenses.

The RFQ-S was sent to the five vendors shown in Table 1. To keep things comparable, REDcap Cloud was chosen as the vendor for REDcap. They offer the same type of software-as-a-service as the other venders, though REDcap is also available for on-premises installations for free.

The bids we received covered an extremely wide range, with the highest price more than ten times larger than the lowest. Each of the responding vendors specified which of the 141 features they offered. Only one vendor, Qualtrics, stated that it was not acceptable to use their software for the benefit of any type of external organizations. Qualtrics requires an expensive commercial license when third parties are involved.

We then tested each feature, verifying the companies’ claims. We considered rating each for ease-of-use and effectiveness, but found that if the feature was implemented at all, it was generally both easy to use and effective.

We then created a composite score which weighted each feature according to the ratings from the feature importance survey. Our purchasing department prefers 1,000-point scales, so we adjusted the scores accordingly. A perfect score of 1,000 would indicate software that offered every feature that a demanding person would describe as Very Important. The resulting scores are shown in Table 2. Readers can use the data in the appendix to develop their own scores. Since REDcap Cloud did not respond to our bid, we did not evaluate or score it. However, does offer an extensive feature set including some very advanced features for database use and clinical trials. We are already using the free on-site version for projects that require those features, but it’s not as easy to use as the other products.

	Qualtrics	QuestionPro	SurveyGizmo	SurveyMonkey
Total number of features (out of 141)	134	132	130	111
Raw Score	541	529	524	456
Scaled score (x 1.418)	767	750	744	647

Table 2. Feature counts and scores weighted by feature importance. Each feature is listed in the appendix. See vendors for pricing.

QuestionPro and SurveyGizmo offered nearly identical feature sets to Qualtrics, with equivalent ease-of-use, at significantly lower prices. SurveyMonkey offered a smaller set of features, at a price that was much lower than Qualtrics, but still much higher than the others.

We called the three references that QuestionPro and SurveyGizmo had each provided. They all reported that the software worked well, had close to zero downtime, had technical support that was quick and competent, and they all reported planning to continue using their chosen vendor well into the future.

QuestionPro not only scored the highest on the 141 attributes we initially focused on, but it also currently offers advanced features that we had not even considered. The company’s future features roadmap is substantially more advanced than any other product currently on the market, and anything we heard regarding the other vendors’ future plans. As a result, we decided to go with QuestionPro.

Migration Issues

We have changed web survey tools twice before, so we are well aware of the challenges involved. When moving to a new software platform, the software cost is important, but so are migration costs. For example, if we were to attempt a migration from SAS to R in one year’s time, the cost to train people and convert tens of thousands of programs would far outweigh the savings (at least at educational prices). However, survey software is relatively easy to learn, taking an hour to learn how to set up a typical survey. We estimate that a typical survey could be migrated in well under an hour. Complex ones might take much longer, especially those that must migrate longitudinal data along with the survey.

The Qualtrics administrative dashboard allows us to download extensive details about how our people have used the software over the last five years. The total number of accounts is intimidating, at just over 11,000. However, thousands of accounts contain only a single survey with a single response, indicating that the person was simply trying the software out. Thousands more accounts have not been logged into in years. We learned that 80% of surveys each year are created by new users who are starting from scratch and thus have no work to migrate.

We have developed preliminary estimates of migration effort from current usage, and they range between a total of 1,000 and 4,000 hours. Our support staff will be available to help users migrate their projects. We are developing a survey to get more details from current users to better assess migration needs. This will allow us to determine the number of projects that entail additional complexity such as multi-institutional collaboration or longitudinal studies that must maintain long-term data compatibility. We will also be recording the time it takes to move surveys and data to the new system. As we collect such data, we will be able to forecast how long the migration will take.

Conclusion

Our work resulted in the selection of a software package, QuestionPro, which is comparable to Qualtrics at a much lower price. Comparing our final expenditure to the initial price increase that our Qualtrics sales rep requested, the savings over the 5-year life of the contract is over $900,000. In addition, we now have a product that members of the university community can use to solve problems for companies, helping to engage UT more fully into the economy of Tennessee.

If you found this post useful, I invite you to check out many more on my website or follow me on Twitter.

Appendix: Web survey tool features and their importance ratings

The importance of each feature (1=very low importance…5= very important) was determined by current users of web survey tools at the University of Tennessee. A rating of zero indicates that the software lacks that feature. This information was current as of 12/1/2017; the vendors all regularly add new features so check their web sites for their latest information.

Feature	Qualtrics	QuestionPro	SurveyGizmo	SurveyMonkey
Display Text (e.g. instructions)	4.46	4.46	4.46	4.46
Skip Logic	4.65	4.65	4.65	4.65
Display Logic	4.65	4.65	4.65	4.65
Required Questions / Required Answers	4.64	4.64	4.64	4.64
Redirect Browser at end of survey	3.83	3.83	3.83	3.83
Anonymous Responses (separate survey data from distribution/contact information)	4.51	4.51	4.51	4.51
Embedded Data/Hidden Values/Custom Values	3.85	3.85	3.85	3.85
Survey Collaboration	4.51	4.51	4.51	4.51
Single Response	4.68	4.68	4.68	4.68
Multiple Response	4.70	4.70	4.70	4.70
Likert Grid/Matrix	4.60	4.60	4.60	4.60
Open-ended text	4.60	4.60	4.60	4.60
Email distribution	4.72	4.72	4.72	4.72
Contact Management	4.50	4.50	4.50	4.50
Send Reminders	4.57	4.57	4.57	4.57
Export to CSV	4.69	4.69	4.69	4.69
Export to at least one statistics package (such as SAS, SPSS…)	4.51	4.51	4.51	4.51
SSO login	4.51	4.51	4.51	4.51
Phone Support (for users and admins)	4.51	4.51	4.51	4.51

Integrated Question Design Methodology Advice	0.00	0.00	0.00	3.96
Machine Learning to Improve Response Rate	0.00	3.84	0.00	3.84
Randomization of Questions	3.71	3.71	3.71	3.71
Randomization of Responses	3.57	3.57	3.57	3.57
A/B Test Questions (Split respondents between scenarios)	3.82	3.82	3.82	3.82
Multi-Lingual Surveys	3.68	3.68	3.68	3.68
Quiz Development	3.74	3.74	3.74	3.74
Score Survey	3.89	3.89	3.89	3.89
Custom Survey Templates	4.23	4.23	4.23	4.23
Branch Logic	4.62	4.62	4.62	4.62
Preview Survey for Standard Screen Size	4.58	4.58	4.58	4.58
Preview Survey for Phone Screen Size	4.58	4.58	4.58	4.58
Easy to Create Interface	3.95	3.95	3.95	3.95
Recode Values/Set Reporting Values (i.e. set Yes to 1 and No to 0 or recode on an unsual scale such as 0,1,3,5)	4.16	4.16	4.16	4.16
Custom Javascript Support	3.59	3.59	3.59	0.00
Loop&Merge / Page Piping (loop through the same set of questions a given number of times)	3.94	3.94	3.94	3.94
Insert Piped Text from question or custom value into question text	4.02	4.02	4.02	4.02
Insert Piped Text from question or custom value into question response	3.95	3.95	3.95	3.95
Carry Forward selected answers to populate future questions	4.11	4.11	4.11	4.11
Carry Forward response choices to populate responses on future questions	4.10	4.10	4.10	0.00
Soft-Require Question (request response)	4.41	4.41	4.41	0.00
Optional Progress Bar	4.21	4.21	4.21	4.21
API Integration	3.89	3.89	3.89	3.89
GoogleForm Integration	0.00	3.82	0.00	3.82
Email Triggers/Action Alerts (trigger emails automatically to be sent based on survey responses)	4.17	4.17	4.17	4.17
Contact List Triggers (add people to contact list automatically based on how they answer a survey)	3.98	3.98	0.00	0.00
Edit Next, Submit and Close Button Text	4.23	4.23	4.23	4.23
Hide Next, Submit or Close buttons	4.03	4.03	4.03	0.00
File Library	4.08	4.08	4.08	4.08
Import Surveys Questions from Word	0.00	4.28	4.28	4.28
Export Survey Questions to Word	4.39	4.39	4.39	0.00
Import/Export Survey file (to create a copy that can be exported/imported)	4.61	4.61	4.61	4.61
Copy Survey within survey tool	4.64	4.64	4.64	4.64
Revert to previous versions	4.16	4.16	4.16	0.00
Password/contact list authentication within a survey (respondent must authenticate with credentials saved within the contact list to continue)	3.90	3.90	3.90	3.90
SSO authentication within a survey (respondent must authenticate with SSO credentials to continue)	3.78	3.78	3.78	3.78
Connect to Web Services (such as random number generator)	3.79	3.79	3.79	0.00
Survey Collaboration within the university/license/brand	4.30	4.30	4.30	4.30
Survey Collaboration outside of the university/license/brand	3.81	0.00	3.81	3.81

Numeric Entry	4.41	4.41	4.41	4.41
Constant Sum	3.56	3.56	3.56	0.00
Slider	3.63	3.63	3.63	3.63
Ranking	4.19	4.19	4.19	4.19
Descriptive Text/Graphic	4.46	4.46	4.46	4.46
Side by Side	3.99	3.99	3.99	3.99
Captcha	3.01	3.01	3.01	0.00
Signature	3.25	3.25	3.25	0.00
Closed Card Sort Questions	3.08	3.08	3.08	0.00
Image Heatmap Questions	2.88	2.88	2.88	0.00
Open Card Sort Questions	0.00	3.08	3.08	0.00
Semantic Differential Questions	3.28	3.28	3.28	3.28
Text Highlighter Questions	3.53	3.53	3.53	0.00
Choice Based Conjoint	3.35	3.35	3.35	0.00
Max Differential (Max Diff) Questions	3.39	3.39	3.39	0.00
Track Time Respondent Stays on a Question or Page	3.55	3.55	3.55	0.00
Audio/Video Sentiment Questions (slider recording response while video/audio plays)	0.00	3.45	3.45	0.00
File Upload	4.07	4.07	4.07	4.07
Geo-Targeting/Tracking	3.20	3.20	3.20	3.20
Embed External Audio and Video Files	3.81	3.81	3.81	3.81

Real-time Responses	4.13	4.13	4.13	4.13
Filter Responses	4.39	4.39	4.39	4.39
Printed Reports	4.37	4.37	4.37	4.37
Cross-Tabulation Reporting	4.31	4.31	4.31	4.31
Variable Creation	4.20	4.20	0.00	0.00
Response Editing	4.05	4.05	4.05	4.05
View Reports Online	4.67	4.67	4.67	4.67
Export report from within offline report (pdf or other format)	4.51	4.51	4.51	4.51
Import response data from Excel or other format	4.26	4.26	4.26	0.00
Manually enter response data collected externally	4.16	4.16	4.16	4.16
Export reports (Please attach list formats)	4.67	4.67	4.67	4.67
Export Individual Charts	4.39	4.39	4.39	4.39
Create reports from multiple data sources	4.35	4.35	4.35	4.35

Share Contact Lists	4.19	4.19	4.19	4.19
Group organization	4.33	4.33	4.33	4.33
Social Media Distribution	3.88	3.88	3.88	3.88
SMS Distribution	3.60	3.60	0.00	0.00
Template Library of Messages	4.06	4.06	0.00	4.06
Embed First Question in Email Invite	3.29	3.29	3.29	3.29
Mobile First UX	3.95	3.95	3.95	3.95
Response Panels	2.86	2.86	2.86	2.86
URL Shortener	3.91	3.91	3.91	3.91
Survey Quotas	3.43	3.43	3.43	3.43
Schedule email invites and reminders	4.41	4.41	4.41	4.41
Schedule survey to close	4.53	4.53	4.53	4.53
Resend Link after completion	3.31	3.31	3.31	3.31
Send email to Contact List without survey link	3.32	0.00	3.32	3.32
Optional opt out link in emails	4.07	4.07	4.07	4.07
Stand alone Mobile App	4.17	4.17	0.00	4.17
Offline Mode	3.65	3.65	3.65	0.00

SPSS	4.32	4.32	4.32	4.32
PDF	4.31	4.31	4.31	4.31
PowerPoint	3.93	3.93	3.93	3.93
Change variable labels and/or response values before export	4.19	0.00	4.19	4.19
Select if data are exported as Response Text or Response Value	4.37	4.37	4.37	4.37
Export subset of data	4.43	4.43	4.43	4.43

Word Cloud	3.48	3.48	3.48	3.48
Tag Text Themes Manually	3.60	3.60	3.60	3.60
Sentiment Analysis	3.61	3.61	0.00	0.00
Nvivo Integration	3.77	0.00	3.77	3.77
Automatic Tagging	3.56	0.00	3.56	0.00

Admin Dashboard	3.95	3.95	3.95	3.95
Ability to log into user account through Admin account	3.95	0.00	3.95	0.00
Encryption at Rest	3.95	3.95	3.95	3.95
Email Support (for users and admins)	3.95	3.95	3.95	3.95
Chat Support (for users and admins)	3.95	3.95	3.95	0.00
Set up User Groups (for shared access)	3.95	3.95	3.95	3.95
Dedicated Account Manager	3.95	3.95	3.95	3.95
Migrate Survey and Respondent Data	4.25	4.25	4.25	4.25
Account Management	3.95	3.95	3.95	3.95
Ability to set email limits per account	3.95	0.00	3.95	0.00
Ability to set access to different question types	3.95	0.00	0.00	0.00
Multiple Admin Accounts	3.95	3.95	3.95	3.95
Auto creation of accounts (via SSO)	3.95	3.95	3.95	3.95

HIPPA Compliant	3.95	3.95	3.95	3.95
FERPA Compliant	3.95	3.95	3.95	3.95
ADA Compliance (Fully Accessible/508 Compliant)	3.95	3.95	3.95	3.95
SAS no. 70
PCI DSS	0.00	3.95	3.95	3.95
ISO 27001	3.95	3.95	0.00	3.95
OWASP	3.95	0.00	3.95	0.00
Data Encryption in transit	3.95	3.95	3.95	3.95
Data Encryption at rest	3.95	3.95	3.95	3.95
Data Encryption on all backups	3.95	3.95	3.95	3.95

Group-By Modeling in R Made Easy

There are several aspects of the R language that make it hard to learn, and repeating a model for groups in a data set used to be one of them. Here I briefly describe R’s built-in approach, show a much easier one, then refer you to a new approach described in the superb book, R for Data Science, by Hadley Wickham and Garrett Grolemund.

For ease of comparison, I’ll use some of the same examples in that book. The gapminder data set contains a few measurements for countries around the world every five years from 1952 through 2007.

> library("gapminder")
> gapminder

# A tibble: 1,704 × 6
 country continent year lifeExp pop gdpPercap
 <fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
# ... with 1,694 more rows

Let’s create a simple regression model to predict life expectancy from year. We’ll start by looking at just New Zealand.

> library("tidyverse")
> nz <- filter(gapminder, 
+              country == "New Zealand")
> nz_model <- lm(lifeExp ~ year, data = nz)
> summary(nz_model)

Call:
lm(formula = lifeExp ~ year, data = nz)

Residuals:
 Min 1Q Median 3Q Max 
-1.28745 -0.63700 0.06345 0.64442 0.91192

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -307.69963 26.63039 -11.55 4.17e-07 ***
year 0.19282 0.01345 14.33 5.41e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8043 on 10 degrees of freedom
Multiple R-squared: 0.9536, Adjusted R-squared: 0.9489 
F-statistic: 205.4 on 1 and 10 DF, p-value: 5.407e-08

If we had just a few countries, and we wanted to simply read the output (rather than processing it further) we could write a simple function and apply it using R’s built-in by() function. Here’s what that might look like:

my_lm <- function(df) {
  summary(lm(lifeExp ~ year, data = df))
}
by(gapminder, gapminder$country, my_lm)
...
----------------------------------------------- 
gapminder$country: Zimbabwe

Call:
lm(formula = lifeExp ~ year, data = df)

Residuals:
 Min 1Q Median 3Q Max 
-10.581 -4.870 -0.882 5.567 10.386 

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 236.79819 238.55797 0.993 0.344
year -0.09302 0.12051 -0.772 0.458

Residual standard error: 7.205 on 10 degrees of freedom
Multiple R-squared: 0.05623, Adjusted R-squared: -0.03814 
F-statistic: 0.5958 on 1 and 10 DF, p-value: 0.458

Since we have so many countries, that wasn’t very helpful. Much of the output scrolled out of sight (I’m showing only the results for the last one, Zimbabwe). But in a simpler case, that might have done just what you needed. It’s a bit more complex than how SAS or SPSS would do it since it required the creation of a function, but it’s not too difficult.

In our case, it would be much more helpful to save the output to a file for further processing. That’s when things get messy. We could use the str() function to study the structure of the output, then write another function to extract the pieces we need, then apply that function, then continue to process the result until we get what we finally end up with a useful data frame of results. Altogether, that is a lot of work! To make matters worse, what you learned from all that is unlikely to generalize to a different function. The output’s structure, parameter names, and so on, are often unique to each of R’s modeling functions.

Luckily, David Robinson made a package called broom that simplifies all that. It has three ways to “clean up” a model, each diving more deeply into its details. Let’s see what it does with our model for New Zealand.

> library("broom")
> glance(nz_model)

  r.squared adj.r.squared sigma statistic p.value df
1 0.9535846 0.9489431 0.8043472 205.4459 5.407324e-08 2

   logLik      AIC    BIC   deviance df.residual
1 -13.32064 32.64128 34.096 6.469743     10

The glance() function gives us information about the entire model, and it puts it into a data frame with just one line of output. As we head towards doing a model for each country, you can imagine this will be a very convenient format.

To get a bit more detail, we can use broom’s tidy() function to clean up the parameter-level view.

> tidy(nz_model)
        term     estimate  std.error statistic p.value
1 (Intercept) -307.699628 26.63038965 -11.55445 4.166460e-07
2 year           0.192821  0.01345258  14.33339 5.407324e-08

Now we have a data frame with two rows, one for each model parameter, but getting this result was just as simple to do as the previous example.

The greatest level of model detail is provided by broom’s augment() function. This function adds observation-level detail to the original model data:

> augment(nz_model)

 lifeExp year  .fitted   .se.fit     .resid
1 69.390 1952 68.68692 0.4367774 0.70307692
2 70.260 1957 69.65103 0.3814859 0.60897203
3 71.240 1962 70.61513 0.3306617 0.62486713
4 71.520 1967 71.57924 0.2866904 -0.05923776
5 71.890 1972 72.54334 0.2531683 -0.65334266
6 72.220 1977 73.50745 0.2346180 -1.28744755
7 73.840 1982 74.47155 0.2346180 -0.63155245
8 74.320 1987 75.43566 0.2531683 -1.11565734
9 76.330 1992 76.39976 0.2866904 -0.06976224
10 77.550 1997 77.36387 0.3306617 0.18613287
11 79.110 2002 78.32797 0.3814859 0.78202797
12 80.204 2007 79.29208 0.4367774 0.91192308

        .hat    .sigma      .cooksd .std.resid
1 0.29487179 0.8006048 0.2265612817 1.04093898
2 0.22494172 0.8159022 0.1073195744 0.85997661
3 0.16899767 0.8164883 0.0738472863 0.85220295
4 0.12703963 0.8475929 0.0004520957 -0.07882389
5 0.09906760 0.8162209 0.0402635628 -0.85575882
6 0.08508159 0.7194198 0.1302005662 -1.67338100
7 0.08508159 0.8187927 0.0313308824 -0.82087061
8 0.09906760 0.7519001 0.1174064153 -1.46130610
9 0.12703963 0.8474910 0.0006270092 -0.09282813
10 0.16899767 0.8451201 0.0065524741 0.25385073
11 0.22494172 0.7944728 0.1769818895 1.10436232
12 0.29487179 0.7666941 0.3811504335 1.35014569

Using those functions was easy. Let’s now get them to work repeatedly for each country in the data set. The dplyr package, by Hadly Wickham and Romain Francois, provides an excellent set of tools for group-by processing. The dplyr package was loaded into memory as part of the tidyverse package used above. First we prepare the gapminder data set by using the group_by() function and telling it what variable(s) make up our groups:

> by_country <- 
+   group_by(gapminder, country)

Now any other function in the dplyr package will understand that the by_country data set contains groups, and it will process the groups separately when appropriate. However, we want to use the lm() function, and that does not understand what a grouped data frame is. Luckily, the dplyr package has a do() function that takes care of that problem, feeding any function only one group at a time. It uses the period “.” to represent each data frame in turn. The do() function wants the function it’s doing to return a data frame, but that’s exactly what broom’s functions do.

Let’s repeat the three broom functions, this time by country. We’ll start with glance().

> do(by_country, 
+    glance( 
+       lm(lifeExp ~ year, data = .)))

Source: local data frame [142 x 12]
Groups: country [142]

      country r.squared adj.r.squared sigma
    <fctr>      <dbl>      <dbl>    <dbl>
1 Afghanistan 0.9477123     0.9424835 1.2227880
2 Albania     0.9105778     0.9016355 1.9830615
3 Algeria     0.9851172     0.9836289 1.3230064
4 Angola      0.8878146     0.8765961 1.4070091
5 Argentina   0.9955681     0.9951249 0.2923072
6 Australia   0.9796477     0.9776125 0.6206086
7 Austria     0.9921340     0.9913474 0.4074094
8 Bahrain     0.9667398     0.9634138 1.6395865
9 Bangladesh  0.9893609     0.9882970 0.9766908
10 Belgium    0.9945406     0.9939946 0.2929025

# ... with 132 more rows, and 8 more variables:
# statistic <dbl>, p.value <dbl>, df <int>,
# logLik <dbl>, AIC <dbl>, BIC <dbl>,
# deviance <dbl>, df.residual <int>

Now rather than one row of output, we have a data frame with one row per country. Since it’s a data frame, we already know how to manage it. We could sort by R-squared, or correct the p-values for the number of models done using p.adjust(), and so on.

Next let’s look at the grouped parameter-level output that tidy() provides. This will be the same code as above, simply substituting tidy() where glance() had been.

> do(by_country, 
+    tidy( 
+      lm(lifeExp ~ year, data = .)))

Source: local data frame [284 x 6]
Groups: country [142]

country term estimate std.error
    <fctr>      <chr>        <dbl>     <dbl>
1 Afghanistan (Intercept) -507.5342716 40.484161954
2 Afghanistan year           0.2753287  0.020450934
3 Albania     (Intercept) -594.0725110 65.655359062
4 Albania     year           0.3346832  0.033166387
5 Algeria     (Intercept) -1067.8590396 43.802200843
6 Algeria     year            0.5692797  0.022127070
7 Angola      (Intercept)  -376.5047531 46.583370599
8 Angola      year            0.2093399 0.023532003
9 Argentina   (Intercept)  -389.6063445 9.677729641
10 Argentina  year            0.2317084 0.004888791

# ... with 274 more rows, and 2 more variables:
# statistic <dbl>, p.value <dbl>

Again, this is a simple data frame allowing us to do whatever we need without learning anything new. We can easily search for models that contain a specific parameter that is significant. In our organization, we search through salary models that contain many parameters to see if gender is an important predictor (hoping to find none, of course).

Finally, let’s augment the original model data by adding predicted values, residuals and so on. As you might expect, it’s the same code, this time with augment() replacing the tidy() function.

> do(by_country, 
+ augment( 
+ lm(lifeExp ~ year, data = .)))
Source: local data frame [1,704 x 10]
Groups: country [142]

   country   lifeExp year .fitted  .se.fit
    <fctr>     <dbl> <int> <dbl>    <dbl>
1 Afghanistan 28.801 1952 29.90729 0.6639995
2 Afghanistan 30.332 1957 31.28394 0.5799442
3 Afghanistan 31.997 1962 32.66058 0.5026799
4 Afghanistan 34.020 1967 34.03722 0.4358337
5 Afghanistan 36.088 1972 35.41387 0.3848726
6 Afghanistan 38.438 1977 36.79051 0.3566719
7 Afghanistan 39.854 1982 38.16716 0.3566719
8 Afghanistan 40.822 1987 39.54380 0.3848726
9 Afghanistan 41.674 1992 40.92044 0.4358337
10 Afghanistan 41.763 1997 42.29709 0.5026799

# ... with 1,694 more rows, and 5 more variables:
# .resid <dbl>, .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>

If we were to pull out just the results for New Zealand, we would see that we got exactly the same answer in the group_by result as we did when we analyzed that country by itself.

We can save that augmented data to a file to reproduce one of the residual plots from R for Data Science.

> gapminder_augmented <-
+ do(by_country, 
+   augment( 
+     lm(lifeExp ~ year, data = .)))
> ggplot(gapminder_augmented, aes(year, .resid)) +
+   geom_line(aes(group = country), alpha = 1 / 3) + 
+   geom_smooth(se = FALSE)

`geom_smooth()` using method = 'gam'

This plots the residuals of each country’s model by year by setting “group=country” then it follows it with a smoothed fit (geom_smooth) for all countries (blue line) by leaving out “group=country”. That’s a clever approach that I haven’t thought of before!

The broom package has done several very helpful things. As we have seen, it contains all the smarts needed to extract the important parts of models at three different levels of detail. It doesn’t just do this for linear regression though. R’s methods() function will show you what types of models broom’s functions are currently capable of handling:

> methods(tidy) 
 [1] tidy.aareg* 
 [2] tidy.acf* 
 [3] tidy.anova* 
 [4] tidy.aov* 
 [5] tidy.aovlist* 
 [6] tidy.Arima* 
 [7] tidy.betareg* 
 [8] tidy.biglm* 
 [9] tidy.binDesign* 
[10] tidy.binWidth* 
[11] tidy.boot* 
[12] tidy.brmsfit* 
[13] tidy.btergm* 
[14] tidy.cch* 
[15] tidy.character* 
[16] tidy.cld* 
[17] tidy.coeftest* 
[18] tidy.confint.glht* 
[19] tidy.coxph* 
[20] tidy.cv.glmnet* 
[21] tidy.data.frame* 
[22] tidy.default* 
[23] tidy.density* 
[24] tidy.dgCMatrix* 
[25] tidy.dgTMatrix* 
[26] tidy.dist* 
[27] tidy.ergm* 
[28] tidy.felm* 
[29] tidy.fitdistr* 
[30] tidy.ftable* 
[31] tidy.gam* 
[32] tidy.gamlss* 
[33] tidy.geeglm* 
[34] tidy.glht* 
[35] tidy.glmnet* 
[36] tidy.glmRob* 
[37] tidy.gmm* 
[38] tidy.htest* 
[39] tidy.kappa* 
[40] tidy.kde* 
[41] tidy.kmeans* 
[42] tidy.Line* 
[43] tidy.Lines* 
[44] tidy.list* 
[45] tidy.lm* 
[46] tidy.lme* 
[47] tidy.lmodel2* 
[48] tidy.lmRob* 
[49] tidy.logical* 
[50] tidy.lsmobj* 
[51] tidy.manova* 
[52] tidy.map* 
[53] tidy.matrix* 
[54] tidy.Mclust* 
[55] tidy.merMod* 
[56] tidy.mle2* 
[57] tidy.multinom* 
[58] tidy.nlrq* 
[59] tidy.nls* 
[60] tidy.NULL* 
[61] tidy.numeric* 
[62] tidy.orcutt* 
[63] tidy.pairwise.htest* 
[64] tidy.plm* 
[65] tidy.poLCA* 
[66] tidy.Polygon* 
[67] tidy.Polygons* 
[68] tidy.power.htest* 
[69] tidy.prcomp* 
[70] tidy.pyears* 
[71] tidy.rcorr* 
[72] tidy.ref.grid* 
[73] tidy.ridgelm* 
[74] tidy.rjags* 
[75] tidy.roc* 
[76] tidy.rowwise_df* 
[77] tidy.rq* 
[78] tidy.rqs* 
[79] tidy.sparseMatrix* 
[80] tidy.SpatialLinesDataFrame* 
[81] tidy.SpatialPolygons* 
[82] tidy.SpatialPolygonsDataFrame*
[83] tidy.spec* 
[84] tidy.stanfit* 
[85] tidy.stanreg* 
[86] tidy.summary.glht* 
[87] tidy.summary.lm* 
[88] tidy.summaryDefault* 
[89] tidy.survexp* 
[90] tidy.survfit* 
[91] tidy.survreg* 
[92] tidy.table* 
[93] tidy.tbl_df* 
[94] tidy.ts* 
[95] tidy.TukeyHSD* 
[96] tidy.zoo* 
see '?methods' for accessing help and source code
>

Each of those models contain similar information, but often stored in a completely different data structure and named slightly different things, even when they’re nearly identical. While that covers a lot of model types, R has hundreds more. David Robinson, the package’s developer, encourages people to request adding additional ones by opening an issue here.

I hope I’ve made a good case that doing group-by analyses in R can be done easily through the combination of dplyr’s do() function and broom’s three functions. That approach handles the great majority of group-by problems that I’ve seen in my 35-year career. However, if your needs are not met by this approach, then I encourage you to read Chapter 25 of R for Data Science (update: in the printed version of the book, it’s Chapter 20, Many Models with purrr and broom.) But as the chapter warns, it will “stretch your brain!”

If your organization is interested in a hands-on workshop that covers many similar topics, please drop me a line. Have fun with your data analyses!

R Training at Nicholls State University

I’ll be presenting two workshops back-to-back at Nicholls State University in Thibodaux Louisiana June 14-16. The first workshop will cover a broad range of R topics. Each topic will include a brief comparison to how R differs from the popular commercial data science packages, SAS, SPSS, and Stata. If you have a background in any of those packages, you’ll know the things that are likely to trip you up as you learn R. If not, you’ll just come away with a solid intro to R along with how it compares to other software.

The second workshop will focus on the broad range of data management tasks that are usually needed to prepare your data for analysis. Both workshops will be done using the very latest R packages to make things as easy as possible. These packages include tibble, dplyr, magrittr, stringr, lubridate, broom, and more.

Seats are still available at no charge with the Nicholls’ State community and other academics getting priority seating. To register for the workshop, please contact professor Allyse Ferrara, 985/448-4736, or allyse.ferrara@nicholls.edu.

Adding the SPSS MEAN.n Function to R

SPSS contains a very useful set of functions that R lacks. If you’re lucky enough to have access to SPSS, you can use SPSS and R very well together. If not, it’s easy to add these functions to R. The functions perform calculations across values within each observation. Rather than limit you to removing missing values or not, they let you specify how many valid values you want before setting the result to missing. For example in SPSS,
MEAN.5(Q1 TO Q10) asks for the mean only if at least five of the ten variables have valid values. Otherwise the result will be a missing value. This “.n” extension is also available for SPSS’ SUM, SD, VARIANCE, MIN and MAX functions.

Let’s now take a look at how to do this in R. First we’ll create some data with different numbers of missing values for each observations.

> q1 <- c(1, 1, 1)
> q2 <- c(2, 2, NA)
> q3 <- c(3, NA, NA)
> df <- data.frame(q1, q2, q3)
> df
q1 q2  q3
1 1 2   3
2 1 2  NA
3 1 NA NA

R already has a mean function, but it lacks a function to count the number of valid values. A common way to do this in R is to use the is.na() function to generate a vector of TRUE/FALSE values for missing or not, respectively, then sum them. As with many software packages, R views TRUE as having the value 1 and FALSE as having a value of 0, so this approach gets us the number of missing values. The “!” symbol means “not” in R so !is.na() will find the number non-missing values. Here’s a function that does this:

> nvalid <- function(x) sum(!is.na(x))
> nvalid(q2)
[1] 2

So it has found that there are two valid values for q2. This nvalid() function obviously works on vectors, but we need to apply it to the rows of our data frame. We can select the first three variables using df[1:3] and then pass the result into as.matrix() to make the rows easily accessible by R’s apply() function. The apply() function’s second argument is 1 indicating that we would like to compute the mean across rows (the value 2 would indicate columns). The final arguments are the functions to apply and any arguments they need.

> means  <- apply(as.matrix(df[1:3]), 1, mean, na.rm = TRUE)
> counts <- apply(as.matrix(df[1:3]), 1, nvalid)
> means
[1] 2.0 1.5 1.0
> counts
[1] 3 2 1

We have our means and the counts of valid values, so all that remains is to choose our desired value of counts and accept the mean if the data have that value or greater, but return a missing value (NA) if not. This can be done using the ifelse() function, whose first argument is the logical condition, followed by the value desired when TRUE, then the value when false.

> means <- ifelse(counts >= 2, means, NA)
> means
[1] 2.0 1.5 NA

We’ve seen all the parts work, so all that remains is to put them together into a single function that has two arguments, one for the data frame and one for the n required.

mean.n   <- function(df, n) {
  means <- apply(as.matrix(df), 1, mean, na.rm = TRUE)
  nvalid <- apply(as.matrix(df), 1, function(df) sum(!is.na(df)))
  ifelse(nvalid >= n, means, NA)
}

Let’s test our function requiring 1, 2 and 3 valid values.

> df$mean1 <- mean.n(df[1:3], 1)
> df$mean2 <- mean.n(df[1:3], 2)
> df$mean3 <- mean.n(df[1:3], 3)
> df
q1 q2  q3 mean1 mean2 mean3
1 1 2   3   2.0   2.0     2
2 1 2  NA   1.5   1.5    NA
3 1 NA NA   1.0   NA     NA

That looks good. You could apply this same idea to various other R functions such as sd() or var(). You could also apply it to sum() as SPSS does, but I rarely do that. If you were creating a scale score from a set of survey Likert items measuring agreement and a person replied “strongly agree” (a value of 5), to only half the items but skipped the others, would you want the resulting score to be a neutral value as the sum would imply, or “strongly agree” as the mean would indicate? The mean makes much more sense in most situations. Be careful though as there are standardized tests that require use of the sum.

If you’re an SPSS user looking to learn just enough R to use the two together, you might want to read this, or to learn more you could take one of my workshops. If you really want to dive into the details, you might consider reading my book, R for SAS and SPSS Users.

R Workshops Updated to Include the Latest Packages

Two new R packages are quickly becoming standards in the R community:
Hadley Wickham’s dplyr and tidyr. The dplyr package almost completely replaces his popular plyr package for data manipulation. Most importantly for general R use, it makes it much easier to select variables. For example,

R workshop series presented at a major pharmaceutical company. Photography by Stephen Bernard.

if your data included variables for race, gender, pretest, posttest, and four survey items q1 through q4, you could select various sets of variables using:

library("dplyr")
select(mydata, race, gender) # Just those two variables.
select(mydata, gender:posttest)   # From gender through posttest.
select(mydata, contains("test"))  # Gets pretest & posttest.
select(mydata, starts_with("q"))  # Gets all vars staring with "q".
select(mydata, ends_with("test")) # All vars ending with "test".
select(mydata, num_range("q", 1:4)) # q1 thru q4 regardless of location.
select(mydata, matches("^q"))  # Matches any regular expression.

As I show in my books, these were all possible in R before, but they required much more programming.

The tidyr package replaces Hadley’s popular reshape and reshape2 packages with a data reshaping approach that is simpler and more focused just on the reshaping process, especially converting from “wide” to “long” form and back.

I’ve integrated dplyr in to my workshop R for SAS, SPSS and Stata Users, and both tidyr and dplyr now play extensive roles in my Managing Data with R workshop. The next Virtual Instructor-led Classroom (webinar) version of those workshops I’m doing in partnership with Revolution Analytics during the week of October 6, 2014. I’m also available to teach them at your organization’s site in partnership with RStudio.com (contact me at Muenchen.bob@gmail.com to schedule a visit). These workshops will also soon be available 24/7 at Datacamp.com. “You’ll be able to take Bob’s popular workshops using an interactive combination of video and live exercises in the comfort of your own browser” said Jonathan Cornelissen, CEO of Datacamp.com.

Learn to Manage Data at useR! 2014 or Online April 25

Before you can analyze data, it must be in the right form. Join me on April 25th for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr, lubridate and sqldf. I’ll also be presenting a 3-hour version at the UseR! 2014 conference. Here’s a list of the topics covered:

Transformation basics
Conditional transformations
Summarization of columns and rows
Summarization by group
Analysis by group
Sorting data
Selecting first or last observation per group
Miscellaneous variable tools (rename, keep, drop)
Stacking data frames
Finding and removing duplicate observations
Merging data frames
Reshaping data frames
Character string manipulations
Date / time manipulations (not in shorter useR! presentation)
Using SQL within R (not in shorter useR! presentation)

Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!