*by Robert A. Muenchen*

**Introduction**

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses. A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

**Terminology**

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, *GUI users* are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. *IDE users* are people who prefer to write R code to perform their analyses.

**Installation**

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

**Plug-in Modules**

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins" which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

**Startup **

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

**Data Editor **

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data" and “Variables." The “Data" tab is shown by default, but clicking on the “Variables" tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=Strongly Disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right-click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O" when meaning to type a zero “0", or the letter “I" instead of number one “1".

To add rows, the Data tab is clearly labeled, “Click here to add a new row". It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end."

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterward. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans include the automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to the date format. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As" menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

**Data Import**

The open source version of BlueSky supports the following file formats, all located under “File> Open":

- Comma Separated Values (.csv)
- Plain text files (.txt)
- Excel (old and new xls file types)
- Dbase’s DBF
- SPSS (.sav)
- SAS binary files (sas7bdat)
- Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data" menu. The supported formats include:

- Microsoft Access
- Microsoft SQL Server
- MySQL
- PostgreSQL
- SQLite

**Data Export**

The ability to export data to a wide range of file types helps when you, or other members of your research team, have to use multiple tools to complete a task. Unfortunately, this is a very weak area for R GUIs. Deducer offers no data export at all, and R Commander, and rattle can export only delimited text files (an earlier version of this listed jamovi as having very limited data export; that has now been expanded).

BlueSky offers a relatively comprehensive set of export options. The main one missing is SAS’ sas7bdat format, and that’s due to be added in the next release. Here’s the complete list:

Comma Separated Values – *.csv

Dbase – *.dbf

Excel – *.xlsx

IBM SPSS – *.sav

R Objects – *.RData

**Data Management**

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data" menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

- Bin Numeric Variable(s)
- Compute New Variable(s)
- Concatenate Multiple Variables (handling missing values)
- Convert Variable(s) to factors
- Dates: Convert dates to string
- Dates: Convert string to dates
- Delete Variable(s)
- Missing Values
- Rank Variable(s)
- Recode Variable(s)
- Standardize Variables(s)
- Transform Variable(s)
- Weight Variables(s)
- Aggregate to Dataset
- Aggregate to Output
- Merge Datasets
- Refresh Data Grid
- Reload Dataset from File
- Re-order Variables in Dataset Alphbetically
- Reshape: Wide to Long
- Reshape: Long to Wide
- Sample Dataset
- Sort Dataset
- Sort to Output
- Split Dataset: For Group-by Analysis (turn on / off)
- Split Dataset: For Partitioning (random or stratified)
- Data Subset
- Data Subset to Output
- Transpose Dataset: Entire dataset
- Transpose Dataset: Select variables
- Legacy (repeats some of the above, using base R)

**Menus & Dialog Boxes**

The goal of pointing & clicking your way through an analysis is to save time by *recognizing* menu settings rather than performing the more difficult task of *recalling* programming commands. Some GUIs, such as jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

BlueSky uses standard menu choices for running steps listed on the Graphics, Analysis, Model Fitting, or Model Tuning menus. Dialog boxes appear and you select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. You then can click on either “OK" to run the step, or “Syntax" to write the code for that step to the R program editor. To run a variation on the same analysis, the dialog boxes make quick work of it by remembering their previous settings (within a session).

The output is saved not by using the standard “File > Save As" menu, but instead with “Output > Save Output" selection from the main window. Oddly enough, while most menus are duplicated in both the main screen and the Output/Syntax screen, the ability to open or save output only appears on the main screen. If you exit without saving, BlueSky will prompt you to save both output and syntax (if you’ve used any of the latter).

During GUI-driven analysis, the only indication you have that R is doing the work is the code that appears in the output window before each result. However, if you click the “Syntax" button instead of “OK", the program editor will pop out the right side of the output window. The code will be added to the bottom of the program editor, and it will be highlighted so that a click on the “Run" icon will execute it.

**Documentation & Training**

At the moment, this review is probably one of the most thorough written descriptions of how to use BlueSky.

The BlueSkyStatistics.com site offers training videos on how to use it. YouTube.com also offers training videos that show how to use BlueSky.

**Help**

R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.

The level of help that BlueSky provides varies depending on how much help the developers think you need. Each dialog box has a help button in the upper right corner which pops a help window off to the right of the dialog box. For many dialog boxes, it provides a summary description, how to use the dialog box, all the GUI settings, and how the accompanying function works should you choose to write your own code. In the bottom right corner of each dialog box is a “Get R Help" button that takes you to the R help page for the standard R function that actually does the calculations (sometimes these are called directly, other times they’re used inside BlueSky’s functions.)

For some dialog boxes that simply call an R function (e.g. independent samples t-test), BlueSky will display R’s built-in help file. While this variable help approach has been done well, I would prefer a more consistent approach. There are often things in the R help files that are not implemented in BlueSky, so it would be less confusing to eliminate those situations. For example, in the case of the t-test, the help file describes how “formula" works, but that concept is not addressable using BlueSky’s dialog box (nor is it needed).

**Graphics**

The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as jamovi, use their own functions and integrate them into analysis steps. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it. That’s how jamovi works, but then jamovi is limited to its custom graph functions, as nice as they may be.

Bluesky does most of its plots using the popular ggplot2 package, so that’s the code it will create if you want to learn it. BlueSky’s dialogs for creating graphs are extremely easy to use. By comparison, learning ggplot2 code can be confusing at first. BlueSky also offers several of R’s traditional graphics functions, which it places under a “Legacy" menu. While these graphs are usually not as nice as the ones created by the rest of its menus (i.e. those created by ggplot), having both gives you the opportunity to compare both their appearance and the code used to create them.

Here is the selection of plots BlueSky can create.

- Bar Chart
- Bar Chart (means, confidence intervals)
- Boxplot
- Bullseye
- Contour
- Density (continuous)
- Density (counts)
- Frequency charts (factors)
- Frequency charts (numeric)
- Heatmap
- Line Chart
- Line Chart (line drawn in variable order)
- Line Chart (stair-step plot)
- Maps
- Pie Chart
- Plot of Means
- P-P Plots
- Q-Q Plots
- Scatterplot
- Scatterplot 3D
- Scatterplot (Binned hex)
- Scatterplot (Binned Square)
- Stem and Leaf Plot
- Strip Chart
- Violin Plot
- Legacy (repeats some of the above using R’s built-in graphics)

Let’s take a look at how BlueSky does scatterplots, using R’s ggplot2 package behind the scenes. Using the dialog box I chose only the X variable, Y variable, X facet factor, Y facet factor, and the type of smoothing fit. Note that the initial “for" loop allows BlueSky to repeat this plot by levels of a third factor (not used here).

local( { varNames=c('posttest') for (vars in varNames) { print(ggplot(Dataset2,aes(x = pretest, y =eval(parse(text=paste(vars))))) + geom_point() + labs(x = "pretest",y = vars) + facet_grid(workshop~gender) +geom_smooth(method ="lm")) } } )

**Modeling**

The way statistical models (which R stores in “model objects") are created and used, is an area on which R GUIs differ the most. The simplest, and least flexible approach, is taken by jamovi and RKWard. They try to do everything you might need in a single dialog box. They either don’t save models, or they do nothing with them. To an R programmer, that sounds extreme, since R does a lot with model objects. However, neither SAS nor SPSS were able to save models for their first 35 years of their existence, so each approach has its merits.

BlueSky’s modeling approach balances flexibility and ease of use. All its “Model Fitting" dialogs save the resulting model as a model object. They contain a “Model Name" field which is filled in with a useful default name such as, “LinearRegModel1". The analyses listed under “Model Statistics" automatically use the model you set in the upper right corner of the main control screen. You use the “Pick a Model" drop-down menu to choose your model. From then on, all the Model Statistics menu choices will use that model to calculate model measures such as AIC, or perform additional analyses, such as stepwise variable selection. A nice future improvement would be to have the software automatically choose the most recently created model.

The steps BlueSky currently offers to further manipulate models include: Stepwise, AIC, and BIC, Confidence Intervals, Variance Inflation Factors, and the Bonferroni Outlier Test.

**Analysis Methods**

All of the R GUIs offer a decent set of statistical analysis methods. Some also offer machine learning methods too. As you can see in the table below, BlueSky offers an extensive set of analysis methods. It also offers interesting variations on machine learning. Under its “Model Fitting" dialog, it provides direct access to the most popular machine learning algorithms. If you are a beginner at machine learning, that’s where you would start. The menus call the various R functions directly, and if you display the commands, you’ll notice that each uses a slightly different syntax.

If you’re an advanced user of machine learning, you might skip directly to the “Model Tuning" menu. There you’ll find many of the same algorithms, this time controlled in a powerful and standard way using R’s caret package. There you begin by choosing one of four tuning methods and one of the nine machine learning algorithms. BlueSky then passes the work off to the caret package to find your optimal model.

Here is a comprehensive list of BlueSky’s methods of analysis:

- Cluster Analysis: Hierarchical
- Cluster Analysis: KMeans
- Contingency Tables: Multiway
- Contingency Tables: Two-way
- Distributions: Continuous: BetaProbabilities
- Distributions: Continuous: Beta Quantiles
- Distributions: Continuous: Plot Beta Distribution
- Distributions: Continuous: Sample from Beta Distribution
- Distributions: Continuous: Cauchy Probabilities
- Distributions: Continuous: Plot Cauchy Distribution
- Distributions: Continuous: Cauchy Quantiles
- Distributions: Continuous: Sample from Cauchy Distribution
- Distributions: Continuous: Sample from Cauchy Distribution
- Distributions: Continuous: Chi-squared Probabilities
- Distributions: Continuous: Chi-squared Quantiles
- Distributions: Continuous: Plot Chi-squared Distribution
- Distributions: Continuous: Sample from Chi-squared Distribution
- Distributions: Continuous: Exponential Probabilities
- Distributions: Continuous: Exponential Quantiles
- Distributions: Continuous: Plot Exponential Distribution
- Distributions: Continuous: Sample from Exponential Distribution
- Distributions: Continuous: F Probabilities
- Distributions: Continuous: F Quantiles
- Distributions: Continuous: Plot F Distribution
- Distributions: Continuous: Sample from F Distribution
- Distributions: Continuous: Gamma Probabilities
- Distributions: Continuous: Gamma Quantiles
- Distributions: Continuous: Plot Gamma Distribution
- Distributions: Continuous: Sample from Gamma Distribution
- Distributions: Continuous: Gumbel Probabilities
- Distributions: Continuous: Gumbel Quantiles
- Distributions: Continuous: Plot Gumbel Distribution
- Distributions: Continuous: Sample from Gumbel Distribution
- Distributions: Continuous: Logistic Probabilities
- Distributions: Continuous: Logistic Quantiles
- Distributions: Continuous: Plot Logistic Distribution
- Distributions: Continuous: Sample from Logistic Distribution
- Distributions: Continuous: Lognormal Probabilities
- Distributions: Continuous: Lognormal Quantiles
- Distributions: Continuous: Plot Lognormal Distribution
- Distributions: Continuous: Sample from Lognormal Distribution
- Distributions: Continuous: Normal Probabilities
- Distributions: Continuous: Normal Quantiles
- Distributions: Continuous: Plot Normal Distribution
- Distributions: Continuous: Sample from Normal Distribution
- Distributions: Continuous: t Probabilities
- Distributions: Continuous: t Quantiles
- Distributions: Continuous: Plot t Distribution
- Distributions: Continuous: Sample from t Distribution
- Distributions: Continuous: Uniform Probabilities
- Distributions: Continuous: Uniform Quantiles
- Distributions: Continuous: Plot Uniform Distribution
- Distributions: Continuous: Sample from Uniform Distribution
- Distributions: Continuous: Weibull Probabilities
- Distributions: Continuous: Weibull Quantiles
- Distributions: Continuous: Plot Weibull Distribution
- Distributions: Continuous: Sample from Weibull Distribution
- Distributions: Discrete: Binomial Probabilities
- Distributions: Discrete: Binomial Quantiles
- Distributions: Discrete: Binomial Tail Probabilities
- Distributions: Discrete: Plot Binomial Distribution
- Distributions: Discrete: Sample from Binomial Distribution
- Distributions: Discrete: Geometric Probabilities
- Distributions: Discrete: Geometric Quantiles
- Distributions: Discrete: Geometric Tail Probabilities
- Distributions: Discrete: Plot Geometric Distribution
- Distributions: Discrete: Sample from Geometric Distribution
- Distributions: Discrete: Hypergeometric Probabilities
- Distributions: Discrete: Hypergeometric Quantiles
- Distributions: Discrete: Hypergeometric Tail Probabilities
- Distributions: Discrete: Plot Hypergeometric Distribution
- Distributions: Discrete: Sample from Hypergeometric Distribution
- Distributions: Discrete: Negative Binomial Probabilities
- Distributions: Discrete: Negative Binomial Quantiles
- Distributions: Discrete: Negative Binomial Tail Probabilities
- Distributions: Discrete: Plot Negative Binomial Distribution
- Distributions: Discrete: Sample from Negative Binomial Distribution
- Distributions: Discrete: Poisson Probabilities
- Distributions: Discrete: Poisson Quantiles
- Distributions: Discrete: Poisson Tail Probabilities
- Distributions: Discrete: Plot Poisson Distribution
- Distributions: Discrete: Sample from Poisson Distribution
- Factor Analysis: Factor Analysis
- Factor Analysis: Principal Components
- Market Basket: Basket data format
- Market Basket: Display Rules
- Market Basket: Multi-line transaction format
- Market Basket: Multiple variable format
- Market Basket: Plot Rules
- Means: T-Test, Independent Samples
- Means: T-Test, One Sample
- Means: T-Test, Paired Samples
- Means: ANCOVA
- Means: Multi-way ANOVA
- Means: One-way ANOVA
- Means: One-way ANOVA with Blocks
- Means: One-way ANOVA with Random Blocks
- Missing Values: Output Arranged in Columns
- Missing Values: Output Arranged in Rows
- Non-parametric Tests: Chisq Test
- Non-parametric Tests: Friedman Test
- Non-parametric Tests: Kruskal-Wallis Test
- Non-parametric Tests: Wilcoxon, Independent Samples
- Non-parametric Tests: Wilcoxon, Paired Samples
- Proportions: Binomial, Single Sample
- Proportions: Proportion Test, Independent Samples
- Proportions: Proportion Test, Single Sample
- Reliability Analysis (Cronbach’s Alpha, etc.)
- Summary Analysis: Analysis of Missing Values
- Summary Analysis: Frequency Table
- Summary Analysis: Table Top N
- Summary Analysis: Numerical Statistical Analysis
- Summary Analysis: Summary Statistics by Group
- Summary Analysis: Summary Statistics for All Variables
- Summary Analysis: Summary Statistics for Selected Variables
- Summary Analysis: Correlation Matrix
- Summary Analysis: Correlation Test (one pair)
- Summary Analysis: Correlation Test (Multi-variable)
- Summary Analysis: Shapiro-Wilk Normality Test
- Time Series: Automated ARIMA
- Time Series: Exponential Smoothing
- Time Series: Holt-Winters Seasonal
- Time Series: Holt-Winters Non-seasonal
- Variance: Bartlett’s Test
- Variance: Levene’s Test
- Variance Test, Two Samples
- Model Fitting: Contrast Display
- Model Fitting: Contrast Set
- Model Fitting: Decision Trees
- Model Fitting: Display Contrasts
- Model Fitting: GLZM
- Model Fitting: IRT: Simple Rasch Model
- Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
- Model Fitting: IRT: Partial Credit Model
- Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
- Model Fitting: IRT: Rating Scale Model
- Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
- Model Fitting: Linear Modeling
- Model Fitting: Linear Regression: Linear Regression
- Model Fitting: Linear Regression: Linear Regression with Formula
- Model Fitting: Logistic Regression: Logistic Regression
- Model Fitting: Logistic Regression: Logistic Regression with Formula
- Model Fitting: Multinomial Logit
- Model Fitting: Naive Bayes
- Model Fitting: Ordinal Regression
- Model Fitting: Random Forest: Random Forest
- Model Fitting: Random Forest: Tune Random Forest
- Model Fitting: Random Forest: Random Forest: Optimal Number of Trees
- Model Fitting: Summarizing Models for Each Group
- Model Tuning: Bootstrap Resample
- Model Tuning: K-fold Cross-Validation
- Model Tuning: Leave One Out Cross-Validation
- Model Tuning: Repeated K-fold Cross-Validation
- Model Tuning: AdaBoost Classification Trees
- Model Tuning: Bayesian Ridge Regression
- Model Tuning: CART
- Model Tuning: Naive Bayes
- Model Tuning: Random Forest
- Model Tuning: SVM (Linear Kernal)
- Model Tuning: SVM (Polynomial Kernal)
- Model Tuning: SVM (Radial Basis)
- Model Tuning: KNN
- Model Statistics: AIC
- Model Statistics: BIC
- Model Statistics: Bonferroni Outlier Test
- Model Statistics: Confidence Interval
- Model Statistics: Hosmer-Lemeshow Test
- Model Statistics: IRT: ICC Plots
- Model Statistics: IRT: Item Fit
- Model Statistics: IRT: Plot PI Map
- Model Statistics: IRT: Item and Test Information
- Model Statistics: Likelihood Ratio and Beta plots
- Model Statistics: Personfit
- Model Statistics: Pseudo R-Squared
- Model Statistics: Stepwise
- Model Statistics: Variance Inflation Factors

**Generated R Code**

One of the aspects that most differentiates the various GUIs for R is the code they generate. If you decide you want to save code, what type of code is best for you? The base R code as provided by the R Commander which can teach you “classic" R? The concise functions that mimic the simplicity of one-step dialogs such as jamovi provides? The completely transparent (and complex) code provided by RKWard, which might be the best for budding R power users?

BlueSky writes what you might call modern R code. For data management, it uses tidyverse packages; for graphics, it uses ggplot2, and for model tuning it uses the caret package.

Here’s an example of code BlueSky wrote to do a group-by aggregation:

mySummarized <-mydata100 %>% dplyr::group_by(workshop,gender) %>% dplyr::summarize(mean_pretest=mean(pretest,na.rm =TRUE), mean_posttest=mean(posttest,na.rm =TRUE))

Here is an example of code BlueSky wrote to convert my repeated-measures style “long" data set to a “wide" one. The long one had three main variables: an ID variable, a factor Time, and a measure Y. The resulting wide data set had ID and four variables named Time1, Time2, Time3, and Time4. The values of Y were spread across the four time variables. Here’s the code:

require(tidyr); Bobs_Wide <- spread(Bobs_Long,Time,Y) BSkyLoadRefreshDataframe(Bobs_Wide,load.dataframe=TRUE)

Below is an example of BlueSky’s code for a simple linear regression. BlueSky even provided the comments explaining each step, a nice touch! Note that it uses its own set of functions, such as BSkyRegression() instead of R’s built-in lm() function. It’s this function that does both the modeling step and the text formatting step. This is very similar to the approach used by jamovi, except that BlueSky does plotting using R’s standard plot function (one of the few times it uses it) instead of being integrated into a single regression function call.

BSkyLoadRefreshDataframe(BobsAgg) #Builds a linear regression model. Returns an object called #BSkyLinearRegression which is an object of class lm. # Displays a summary of the model, coefficient table, # Anova table and sum of squares table. LinearRegModel1= BSkyRegression(depVars ='posttest', indepVars =c('pretest'),dataset="Dataset2") #Plots residuals vs. fitted, normal Q-Q, theoretical quantiles, #residuals vs. leverage if(TRUE) { plot(LinearRegModel1) }

**Support for Programmers**

Some of the GUIs reviewed in this series of articles include extensive support for programmers. For example, RKWard offers much of the power of Integrated Development Environments (IDEs) such as RStudio or Eclipse StatET. Others, such as jamovi or the R Commander, offer little more than a simple text editor.

While BlueSky’s main mission is to make their point-and-click GUI comprehensive, it does include a basic program editor which supports the writing and debugging of code. The code editor is hidden at start-up, but an arrow at the upper right corner of the output window will pop open the code editor at any time (and pop it closed, if already open). A click on the Syntax button in any dialog box will also pop the code editor open.

The code editor supports syntax highlighting, and it can collapse and expand blocks of code. It also offers some hints on function name completion. For example, typing “m" will cause it to offer “min" and “max" functions, but oddly enough, it will not offer “mean" or “median." It doesn’t provide hints on argument names or values, nor does it offer to complete object names. RStudio and RKWard both offer much more support for coders.

However, the lack of features for coders offers a benefit to GUI users: nearly all the menus and their entries are focused on GUI use. In this regard, BlueSky is the mirror image of RKWard, which has several menus full of features that only coders use.

**Reproducibility & Sharing**

One of the biggest challenges that GUI users face is being able to reproduce their work. Reproducibility is useful for re-running everything on the same dataset if you find a data entry error. It’s also useful for applying your work to new datasets so long as they use the same variable names (or the software can handle name changes). Some scientific journals ask researchers to submit their files (usually code and data) along with their written report so that others can check their work.

As important a topic as it is, reproducibility is a problem for GUI users, a problem that has only recently been solved by some software developers. Most GUIs (e.g. the R Commander, Rattle) save only code, but since the GUI user didn’t write the code, they also can’t read it or change it! Others such as jamovi, RKWard, and the newest version of SPSS save the dialog box entries and allow GUI users to have reproducibility in the form they prefer.

BlueSky offers only code-based reproducibility. There’s no way to get back to a filled-in dialog box when starting from the saved code.

If you wish to share your work with a colleague, you would send them the code and your data set. They could then install the appropriate version of BlueSky to run it. They could also install the “BlueSky Statistics R Package", enabling them to run the code in any R environment. At the moment, that package is only available for download from the company web site. However, the developers plan on moving it to CRAN eventually.

**Package Mangement**

A topic related to reproducibility is *package management*. One of the major advantages to the R language is that it’s very easy to extend its capabilities through add-on packages. However, updates in these packages may break a previously functioning analysis. Years from now you may need to run a variation of an analysis, which would require you to find the version of R you used, plus the packages you used at the time. As a GUI user, you’d also need to find the version of the GUI that was compatible with that version of R.

Some GUIs, such as the R Commander and Deducer, depend on you to find and install R. For them, the problem of long-term stability yours to solve. Others, such as jamovi, distribute their own version of R, and all R packages, but not their add-on modules. This requires a bigger installation file, but it makes dealing with long-term stability simpler. Of course, this depends on all major versions being around for long-term, but for open-source software, there are usually multiple archives available to store software even if the original project is defunct.

BlueSky’s approach to package management is the most comprehensive of the R GUIs reviewerd here. It provides everything you need in a single download. This includes the BlueSky interface, R itself, all R packages, and all BlueSky plug-ins. If you have a problem reproducing a BlueSky analysis in the future, all you need to do is download the version used when you created it.

**Output & Report Writing**

Ideally, output should be clearly labeled, well organized, and of publication quality. It might also delve into the realm of word processing through Sweave/knitr and Rmarkdown documents. At the moment, none of the GUIs covered in this series of reviews meets all of these requirements. See the separate reviews to see how each of the other packages is doing on this topic.

The labels for each of BlueSky’s analyses are provided by its menu title, e.g. Linear Regression. However, double-clicking on the title in the output switches it into edit mode where you can change it to anything you like. Unfortunately, there is no way to add comments or notes in the output, but of course you can do so in the code that it generates in the program editor.

The organization of the output is in time-order only, and you cannot delete any of the steps you take. This often results in a messy output file filled with unneeded results. A table of contents will pop out of the left side of the output window when you choose “Layout> Show Navigation Tree.” While such tables of contents are commonly used in GUIs to let you re-order, rename, or delete bits of output, those tasks are not possible here. There you can un-check any output to hide it, but it’s not deleted. You are better off keeping a word processing file open to paste in the results you want to keep.

BlueSky’s output quality is very high, with nice fonts of your choosing and true rich text tables (see Figure 5). To have them display using the popular style of the American Psychological Association (see Table 1) save the setting: “Options> Configuration Settings> Others> Show output tables in APA style." From that point on, all your output tables will use APA format. That means you can right-click on any table and choose “Export to Word (or Excel)" and the formatting is retained. That really helps speed your work as R output defaults to mono-spaced fonts that require additional steps to get into publication form (e.g. using functions from packages such as xtable or texreg). You can also choose “Copy to Clipboard", but pasting from there into Word will lose the full formatting, while still remaining a true table. All the output is stored in a single file, which can be exported to PDF and from there edited in Microsoft Word.

A nice feature of BlueSky’s output tables is that they are all interactive. So if you have a complex model you’re studying, you can easily sort the output by p-value, or parameter size, or any column you choose. That’s a nice and fairly unique feature.

**Group-By Analyses**

Repeating an analysis on different groups of observations is a core task in data science. Software needs to provide an ability to select a subset one group to analyze, then another subset to compare it to. All the R GUIs reviewed in this series can do this task. BlueSky does single-group selections in “Data> Subset". It generates a subset that you can analyze in the same way as the entire dataset.

Software also needs the ability to automate such selections so that you might generate dozens of analyses, one group at a time. While this has been available in commercial GUIs for decades (e.g. SPSS split-file), BlueSky is the only R GUI that includes this feature. BlueSky automates group-by analyses under “Split> For Analysis> Split". All analyses that follow will be done repeatedly for each level of the factors(s) chosen. This feature is turned off via “Split> For Analysis> Remove Split."

**Output Management**

Early in the development of statistical software, developers tried to guess what output would be important to save to a new dataset (e.g. predicted values, factor scores), and the ability to save such output was built into the analysis procedures themselves. However, researchers were far more creative than the developers anticipated. To better meet their needs, output management systems were created and tacked on to existing tools (e.g. SAS’ Output Delivery System, SPSS’ Output Management System). One of R’s greatest strengths is that every bit of output can be readily used as input. However, for the simplification that GUIs provide, that’s a challenge.

Output data can be observation-level, such as predicted values for each observation or case. When group-by analyses are run, the output data can also be observation-level, but now the (e.g.) predicted values would be created by individual models for each group, rather than one model based on the entire original data set (perhaps with group included as a set of indicator variables).

Group-by analyses can also create model-level data sets, such as one R-squared value for each group’s model. They can also create parameter-level data sets, such as the p-value for each regression parameter for each group’s model. (Saving and using single models is covered under “Modeling" above.)

For example, in our organization, we have 250 departments and want to see if any of them have a gender bias on salary. We write all 250 regression models to a data set, and then search to find those whose gender parameter is significant (hoping to find none, of course!)

BlueSky is the only R GUI reviewed here that does all three levels of output management. To use this function, choose “Model Fitting> Summarizing models for each group", then specify the model and the grouping factor. It automatically creates three data sets, one at each level of analysis. This ability works only regression, ANOVA, and multinomial logistic models. More are planned for future versions.

While BlueSky is ahead of the GUI pack in output management, the approach listed above still makes judgment calls about what output is useful for further analysis. What would you do to analyze an output table not covered by the above methods? Recall that all BlueSky output tables are true tables that can be exported to Word or Excel. Using that approach, you could save any table you like, export it and then open it as a data set to analyze. It’s not the most elegant approach, but it is quite comprehensive.** **

**Developer Issues**

There are 2 ways developers can contribute to the open source project

- Developers who want to add/modify the application e.g. provide new right click controls, integration into big data libraries like Hadoop and Spark, can download the source code from https://github.com/BlueSkyStatistics/BlueSkyRepository.
- Programmers who want to add new statistical analysis to BlueSky Statistics should watch training videos on the dialog editor program.

**Conclusion**

BlueSky Statistics offers an extensive set of tools that are easy for a point-and-click user to use. If you’re looking for a GUI that lets you do the most using just menus and dialog boxes, BlueSky should be on your list of software to try. BlueSky and R Commander are both way out in front of the R GUI competition when it comes to breadth of coverage in data management, graph types, and methods of analysis. I encourage you to read both reviews carefully when choosing between these two. Also keep in mind that while jamovi is newer and currently has fewer features, its developers are adding new ones at a rapid pace.

For a summary of all my R GUI software reviews, see the article, *R Graphical User Interface Comparison*.

**Acknowledgements**

Thanks to the BlueSky team who have done a lot of hard work and made all but the terminal server version of it free and open source. Thanks also to Rachel Ladd, Ruben Ortiz, Christina Peterson, and Josh Price for their editorial suggestions.