A Comparative Review of the BlueSky Statistics GUI for R

by Robert A. Muenchen

Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses. A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others, such as Deducer, install in multiple steps (up to seven steps, depending on your needs). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins" which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data" and “Variables." The “Data" tab is shown by default, but clicking on the “Variables" tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=Strongly Disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right-click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables" tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O" when meaning to type a zero “0", or the letter “I" instead of number one “1".

To add rows, the Data tab is clearly labeled, “Click here to add a new row". It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end."

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterward. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans include the automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to the date format. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As" menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open":

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • Excel (old and new xls file types)
  • Dbase’s DBF
  • SPSS (.sav)
  • SAS binary files (sas7bdat)
  • Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data" menu. The supported formats include:

  • Microsoft Access
  • Microsoft SQL Server
  • MySQL
  • PostgreSQL
  • SQLite

Data Export

The ability to export data to a wide range of file types helps when you, or other members of your research team, have to use multiple tools to complete a task. Unfortunately, this is a very weak area for R GUIs. Deducer offers no data export at all, and R Commander, and rattle can export only delimited text files (an earlier version of this listed jamovi as having very limited data export; that has now been expanded).

BlueSky offers a relatively comprehensive set of export options. The main one missing is SAS’ sas7bdat format, and that’s due to be added in the next release. Here’s the complete list:

Comma Separated Values – *.csv
Dbase – *.dbf
Excel – *.xlsx
IBM SPSS – *.sav
R Objects – *.RData

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data" menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

  1. Bin Numeric Variable(s)
  2. Compute New Variable(s)
  3. Concatenate Multiple Variables (handling missing values)
  4. Convert Variable(s) to factors
  5. Dates: Convert dates to string
  6. Dates: Convert string to dates
  7. Delete Variable(s)
  8. Missing Values
  9. Rank Variable(s)
  10. Recode Variable(s)
  11. Standardize Variables(s)
  12. Transform Variable(s)
  13. Weight Variables(s)
  14. Aggregate to Dataset
  15. Aggregate to Output
  16. Merge Datasets
  17. Refresh Data Grid
  18. Reload Dataset from File
  19. Re-order Variables in Dataset Alphbetically
  20. Reshape: Wide to Long
  21. Reshape: Long to Wide
  22. Sample Dataset
  23. Sort Dataset
  24. Sort to Output
  25. Split Dataset: For Group-by Analysis (turn on / off)
  26. Split Dataset: For Partitioning (random or stratified)
  27. Data Subset
  28. Data Subset to Output
  29. Transpose Dataset: Entire dataset
  30. Transpose Dataset: Select variables
  31. Legacy (repeats some of the above, using base R)

Menus & Dialog Boxes

The goal of pointing & clicking your way through an analysis is to save time by recognizing menu settings rather than performing the more difficult task of recalling programming commands. Some GUIs, such as jamovi, make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning.

BlueSky uses standard menu choices for running steps listed on the Graphics, Analysis, Model Fitting, or Model Tuning menus. Dialog boxes appear and you select variables to place into their various roles. This is accomplished by either dragging the variable names or by selecting them and clicking an arrow located next to the particular role box. You then can click on either “OK" to run the step, or “Syntax" to write the code for that step to the R program editor. To run a variation on the same analysis, the dialog boxes make quick work of it by remembering their previous settings (within a session).

The output is saved not by using the standard “File > Save As" menu, but instead with “Output > Save Output" selection from the main window. Oddly enough, while most menus are duplicated in both the main screen and the Output/Syntax screen, the ability to open or save output only appears on the main screen. If you exit without saving, BlueSky will prompt you to save both output and syntax (if you’ve used any of the latter).

During GUI-driven analysis, the only indication you have that R is doing the work is the code that appears in the output window before each result. However, if you click the “Syntax" button instead of “OK", the program editor will pop out the right side of the output window. The code will be added to the bottom of the program editor, and it will be highlighted so that a click on the “Run" icon will execute it.

Documentation & Training

At the moment, this review is probably one of the most thorough written descriptions of how to use BlueSky.

The BlueSkyStatistics.com site offers training videos on how to use it. YouTube.com also offers training videos that show how to use BlueSky.

Help

R GUIs provide simple task-by-task dialog boxes which generate much more complex code. So for a particular task, you might want to get help on 1) the dialog box’s settings, 2) the custom functions it uses (if any), and 3) the R functions that the custom functions use. Nearly all R GUIs provide all three levels of help when needed. The notable exception that is the R Commander, which lacks help on the dialog boxes themselves.

The level of help that BlueSky provides varies depending on how much help the developers think you need. Each dialog box has a help button in the upper right corner which pops a help window off to the right of the dialog box. For many dialog boxes, it provides a summary description, how to use the dialog box, all the GUI settings, and how the accompanying function works should you choose to write your own code. In the bottom right corner of each dialog box is a “Get R Help" button that takes you to the R help page for the standard R function that actually does the calculations (sometimes these are called directly, other times they’re used inside BlueSky’s functions.)

For some dialog boxes that simply call an R function (e.g. independent samples t-test), BlueSky will display R’s built-in help file. While this variable help approach has been done well, I would prefer a more consistent approach. There are often things in the R help files that are not implemented in BlueSky, so it would be less confusing to eliminate those situations. For example, in the case of the t-test, the help file describes how “formula" works, but that concept is not addressable using BlueSky’s dialog box (nor is it needed).

Graphics

The various GUIs available for R handle graphics in several ways. Some, such as RKWard, focus on R’s built-in graphics. Others, such as jamovi, use their own functions and integrate them into analysis steps. GUIs also differ quite a lot in how they control the style of the graphs they generate. Ideally, you could set the style once, and then all graphs would follow it. That’s how jamovi works, but then jamovi is limited to its custom graph functions, as nice as they may be.

Bluesky does most of its plots using the popular ggplot2 package, so that’s the code it will create if you want to learn it. BlueSky’s dialogs for creating graphs are extremely easy to use. By comparison, learning ggplot2 code can be confusing at first. BlueSky also offers several of R’s traditional graphics functions, which it places under a “Legacy" menu. While these graphs are usually not as nice as the ones created by the rest of its menus (i.e. those created by ggplot), having both gives you the opportunity to compare both their appearance and the code used to create them.

Here is the selection of plots BlueSky can create.

  1. Bar Chart
  2. Bar Chart (means, confidence intervals)
  3. Boxplot
  4. Bullseye
  5. Contour
  6. Density (continuous)
  7. Density (counts)
  8. Frequency charts (factors)
  9. Frequency charts (numeric)
  10. Heatmap
  11. Line Chart
  12. Line Chart (line drawn in variable order)
  13. Line Chart (stair-step plot)
  14. Maps
  15. Pie Chart
  16. Plot of Means
  17. P-P Plots
  18. Q-Q Plots
  19. Scatterplot
  20. Scatterplot 3D
  21. Scatterplot (Binned hex)
  22. Scatterplot (Binned Square)
  23. Stem and Leaf Plot
  24. Strip Chart
  25. Violin Plot
  26. Legacy (repeats some of the above using R’s built-in graphics)

Let’s take a look at how BlueSky does scatterplots, using R’s ggplot2 package behind the scenes. Using the dialog box I chose only the X variable, Y variable, X facet factor, Y facet factor, and the type of smoothing fit. Note that the initial “for" loop allows BlueSky to repeat this plot by levels of a third factor (not used here).

local(
{
varNames=c('posttest')
for (vars in varNames)
{
print(ggplot(Dataset2,aes(x = pretest,
y =eval(parse(text=paste(vars))))) + 
geom_point() + labs(x = "pretest",y = vars) +
facet_grid(workshop~gender) +geom_smooth(method ="lm"))
}
}
)

Figure 4. A faceted scatterplot created by BlueSky and the ggplot2 package.

Modeling

The way statistical models (which R stores in “model objects") are created and used, is an area on which R GUIs differ the most. The simplest, and least flexible approach, is taken by jamovi and RKWard. They try to do everything you might need in a single dialog box. They either don’t save models, or they do nothing with them. To an R programmer, that sounds extreme, since R does a lot with model objects. However, neither SAS nor SPSS were able to save models for their first 35 years of their existence, so each approach has its merits.

BlueSky’s modeling approach balances flexibility and ease of use. All its “Model Fitting" dialogs save the resulting model as a model object. They contain a “Model Name" field which is filled in with a useful default name such as, “LinearRegModel1". The analyses listed under “Model Statistics" automatically use the model you set in the upper right corner of the main control screen. You use the “Pick a Model" drop-down menu to choose your model. From then on, all the Model Statistics menu choices will use that model to calculate model measures such as AIC, or perform additional analyses, such as stepwise variable selection. A nice future improvement would be to have the software automatically choose the most recently created model.

The steps BlueSky currently offers to further manipulate models include: Stepwise, AIC, and BIC, Confidence Intervals, Variance Inflation Factors, and the Bonferroni Outlier Test.

Analysis Methods

All of the R GUIs offer a decent set of statistical analysis methods. Some also offer machine learning methods too. As you can see in the table below, BlueSky offers an extensive set of analysis methods. It also offers interesting variations on machine learning. Under its “Model Fitting" dialog, it provides direct access to the most popular machine learning algorithms. If you are a beginner at machine learning, that’s where you would start. The menus call the various R functions directly, and if you display the commands, you’ll notice that each uses a slightly different syntax.

If you’re an advanced user of machine learning, you might skip directly to the “Model Tuning" menu. There you’ll find many of the same algorithms, this time controlled in a powerful and standard way using R’s caret package. There you begin by choosing one of four tuning methods and one of the nine machine learning algorithms. BlueSky then passes the work off to the caret package to find your optimal model.

Here is a comprehensive list of BlueSky’s methods of analysis:

  1. Cluster Analysis: Hierarchical
  2. Cluster Analysis: KMeans
  3. Contingency Tables: Multiway
  4. Contingency Tables: Two-way
  5. Distributions: Continuous: BetaProbabilities
  6. Distributions: Continuous: Beta Quantiles
  7. Distributions: Continuous: Plot Beta Distribution
  8. Distributions: Continuous: Sample from Beta Distribution
  9. Distributions: Continuous: Cauchy Probabilities
  10. Distributions: Continuous: Plot Cauchy Distribution
  11. Distributions: Continuous: Cauchy Quantiles
  12. Distributions: Continuous: Sample from Cauchy Distribution
  13. Distributions: Continuous: Sample from Cauchy Distribution
  14. Distributions: Continuous: Chi-squared Probabilities
  15. Distributions: Continuous: Chi-squared Quantiles
  16. Distributions: Continuous: Plot Chi-squared Distribution
  17. Distributions: Continuous: Sample from Chi-squared Distribution
  18. Distributions: Continuous: Exponential Probabilities
  19. Distributions: Continuous: Exponential Quantiles
  20. Distributions: Continuous: Plot Exponential Distribution
  21. Distributions: Continuous: Sample from Exponential Distribution
  22. Distributions: Continuous: F Probabilities
  23. Distributions: Continuous: F Quantiles
  24. Distributions: Continuous: Plot F Distribution
  25. Distributions: Continuous: Sample from F Distribution
  26. Distributions: Continuous: Gamma Probabilities
  27. Distributions: Continuous: Gamma Quantiles
  28. Distributions: Continuous: Plot Gamma Distribution
  29. Distributions: Continuous: Sample from Gamma Distribution
  30. Distributions: Continuous: Gumbel Probabilities
  31. Distributions: Continuous: Gumbel Quantiles
  32. Distributions: Continuous: Plot Gumbel Distribution
  33. Distributions: Continuous: Sample from Gumbel Distribution
  34. Distributions: Continuous: Logistic Probabilities
  35. Distributions: Continuous: Logistic Quantiles
  36. Distributions: Continuous: Plot Logistic Distribution
  37. Distributions: Continuous: Sample from Logistic Distribution
  38. Distributions: Continuous: Lognormal Probabilities
  39. Distributions: Continuous: Lognormal Quantiles
  40. Distributions: Continuous: Plot Lognormal Distribution
  41. Distributions: Continuous: Sample from Lognormal Distribution
  42. Distributions: Continuous: Normal Probabilities
  43. Distributions: Continuous: Normal Quantiles
  44. Distributions: Continuous: Plot Normal Distribution
  45. Distributions: Continuous: Sample from Normal Distribution
  46. Distributions: Continuous: t Probabilities
  47. Distributions: Continuous: t Quantiles
  48. Distributions: Continuous: Plot t Distribution
  49. Distributions: Continuous: Sample from t Distribution
  50. Distributions: Continuous: Uniform Probabilities
  51. Distributions: Continuous: Uniform Quantiles
  52. Distributions: Continuous: Plot Uniform Distribution
  53. Distributions: Continuous: Sample from Uniform Distribution
  54. Distributions: Continuous: Weibull Probabilities
  55. Distributions: Continuous: Weibull Quantiles
  56. Distributions: Continuous: Plot Weibull Distribution
  57. Distributions: Continuous: Sample from Weibull Distribution
  58. Distributions: Discrete: Binomial Probabilities
  59. Distributions: Discrete: Binomial Quantiles
  60. Distributions: Discrete: Binomial Tail Probabilities
  61. Distributions: Discrete: Plot Binomial Distribution
  62. Distributions: Discrete: Sample from Binomial Distribution
  63. Distributions: Discrete: Geometric Probabilities
  64. Distributions: Discrete: Geometric Quantiles
  65. Distributions: Discrete: Geometric Tail Probabilities
  66. Distributions: Discrete: Plot Geometric Distribution
  67. Distributions: Discrete: Sample from Geometric Distribution
  68. Distributions: Discrete: Hypergeometric Probabilities
  69. Distributions: Discrete: Hypergeometric Quantiles
  70. Distributions: Discrete: Hypergeometric Tail Probabilities
  71. Distributions: Discrete: Plot Hypergeometric Distribution
  72. Distributions: Discrete: Sample from Hypergeometric Distribution
  73. Distributions: Discrete: Negative Binomial Probabilities
  74. Distributions: Discrete: Negative Binomial Quantiles
  75. Distributions: Discrete: Negative Binomial Tail Probabilities
  76. Distributions: Discrete: Plot Negative Binomial Distribution
  77. Distributions: Discrete: Sample from Negative Binomial Distribution
  78. Distributions: Discrete: Poisson Probabilities
  79. Distributions: Discrete: Poisson Quantiles
  80. Distributions: Discrete: Poisson Tail Probabilities
  81. Distributions: Discrete: Plot Poisson Distribution
  82. Distributions: Discrete: Sample from Poisson Distribution
  83. Factor Analysis: Factor Analysis
  84. Factor Analysis: Principal Components
  85. Market Basket: Basket data format
  86. Market Basket: Display Rules
  87. Market Basket: Multi-line transaction format
  88. Market Basket: Multiple variable format
  89. Market Basket: Plot Rules
  90. Means: T-Test, Independent Samples
  91. Means: T-Test, One Sample
  92. Means: T-Test, Paired Samples
  93. Means: ANCOVA
  94. Means: Multi-way ANOVA
  95. Means: One-way ANOVA
  96. Means: One-way ANOVA with Blocks
  97. Means: One-way ANOVA with Random Blocks
  98. Missing Values: Output Arranged in Columns
  99. Missing Values: Output Arranged in Rows
  100. Non-parametric Tests: Chisq Test
  101. Non-parametric Tests: Friedman Test
  102. Non-parametric Tests: Kruskal-Wallis Test
  103. Non-parametric Tests: Wilcoxon, Independent Samples
  104. Non-parametric Tests: Wilcoxon, Paired Samples
  105. Proportions: Binomial, Single Sample
  106. Proportions: Proportion Test, Independent Samples
  107. Proportions: Proportion Test, Single Sample
  108. Reliability Analysis (Cronbach’s Alpha, etc.)
  109. Summary Analysis: Analysis of Missing Values
  110. Summary Analysis: Frequency Table
  111. Summary Analysis: Table Top N
  112. Summary Analysis: Numerical Statistical Analysis
  113. Summary Analysis: Summary Statistics by Group
  114. Summary Analysis: Summary Statistics for All Variables
  115. Summary Analysis: Summary Statistics for Selected Variables
  116. Summary Analysis: Correlation Matrix
  117. Summary Analysis: Correlation Test (one pair)
  118. Summary Analysis: Correlation Test (Multi-variable)
  119. Summary Analysis: Shapiro-Wilk Normality Test
  120. Time Series: Automated ARIMA
  121. Time Series: Exponential Smoothing
  122. Time Series: Holt-Winters Seasonal
  123. Time Series: Holt-Winters Non-seasonal
  124. Variance: Bartlett’s Test
  125. Variance: Levene’s Test
  126. Variance Test, Two Samples
  127. Model Fitting: Contrast Display
  128. Model Fitting: Contrast Set
  129. Model Fitting: Decision Trees
  130. Model Fitting: Display Contrasts
  131. Model Fitting: GLZM
  132. Model Fitting: IRT: Simple Rasch Model
  133. Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
  134. Model Fitting: IRT: Partial Credit Model
  135. Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
  136. Model Fitting: IRT: Rating Scale Model
  137. Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
  138. Model Fitting: Linear Modeling
  139. Model Fitting: Linear Regression: Linear Regression
  140. Model Fitting: Linear Regression: Linear Regression with Formula
  141. Model Fitting: Logistic Regression: Logistic Regression
  142. Model Fitting: Logistic Regression: Logistic Regression with Formula
  143. Model Fitting: Multinomial Logit
  144. Model Fitting: Naive Bayes
  145. Model Fitting: Ordinal Regression
  146. Model Fitting: Random Forest: Random Forest
  147. Model Fitting: Random Forest: Tune Random Forest
  148. Model Fitting: Random Forest: Random Forest: Optimal Number of Trees
  149. Model Fitting: Summarizing Models for Each Group
  150. Model Tuning: Bootstrap Resample
  151. Model Tuning: K-fold Cross-Validation
  152. Model Tuning: Leave One Out Cross-Validation
  153. Model Tuning: Repeated K-fold Cross-Validation
  154. Model Tuning: AdaBoost Classification Trees
  155. Model Tuning: Bayesian Ridge Regression
  156. Model Tuning: CART
  157. Model Tuning: Naive Bayes
  158. Model Tuning: Random Forest
  159. Model Tuning: SVM (Linear Kernal)
  160. Model Tuning: SVM (Polynomial Kernal)
  161. Model Tuning: SVM (Radial Basis)
  162. Model Tuning: KNN
  163. Model Statistics: AIC
  164. Model Statistics: BIC
  165. Model Statistics: Bonferroni Outlier Test
  166. Model Statistics: Confidence Interval
  167. Model Statistics: Hosmer-Lemeshow Test
  168. Model Statistics: IRT: ICC Plots
  169. Model Statistics: IRT: Item Fit
  170. Model Statistics: IRT: Plot PI Map
  171. Model Statistics: IRT: Item and Test Information
  172. Model Statistics: IRT: Likelihood Ratio and Beta plots
  173. Model Statistics: IRT: Personfit
  174. Model Statistics: Pseudo R-Squared
  175. Model Statistics: Stepwise
  176. Model Statistics: Variance Inflation Factors

Generated R Code

One of the aspects that most differentiates the various GUIs for R is the code they generate. If you decide you want to save code, what type of code is best for you? The base R code as provided by the R Commander which can teach you “classic" R? The concise functions that mimic the simplicity of one-step dialogs such as jamovi provides? The completely transparent (and complex) code provided by RKWard, which might be the best for budding R power users?

BlueSky writes what you might call modern R code. For data management, it uses tidyverse packages; for graphics, it uses ggplot2, and for model tuning it uses the caret package.

Here’s an example of code BlueSky wrote to do a group-by aggregation:

mySummarized <-mydata100 %>%
  dplyr::group_by(workshop,gender) %>%
  dplyr::summarize(mean_pretest=mean(pretest,na.rm =TRUE),
    mean_posttest=mean(posttest,na.rm =TRUE))

Here is an example of code BlueSky wrote to convert my repeated-measures style “long" data set to a “wide" one. The long one had three main variables: an ID variable, a factor Time, and a measure Y. The resulting wide data set had ID and four variables named Time1, Time2, Time3, and Time4. The values of Y were spread across the four time variables. Here’s the code:

require(tidyr);

Bobs_Wide <- spread(Bobs_Long,Time,Y)

BSkyLoadRefreshDataframe(Bobs_Wide,load.dataframe=TRUE)

Below is an example of BlueSky’s code for a simple linear regression. BlueSky even provided the comments explaining each step, a nice touch! Note that it uses its own set of functions, such as BSkyRegression() instead of R’s built-in lm() function. It’s this function that does both the modeling step and the text formatting step. This is very similar to the approach used by jamovi, except that BlueSky does plotting using R’s standard plot function (one of the few times it uses it) instead of being integrated into a single regression function call.

BSkyLoadRefreshDataframe(BobsAgg)

#Builds a linear regression model. Returns an object called 
#BSkyLinearRegression which is an object of class lm. 
# Displays a summary of the model, coefficient table, 
# Anova table and sum of squares table.
LinearRegModel1= BSkyRegression(depVars ='posttest',
  indepVars =c('pretest'),dataset="Dataset2")

#Plots residuals vs. fitted, normal Q-Q, theoretical quantiles, 
#residuals vs. leverage
if(TRUE)
{
plot(LinearRegModel1)
}

Support for Programmers

Some of the GUIs reviewed in this series of articles include extensive support for programmers. For example, RKWard offers much of the power of Integrated Development Environments (IDEs) such as RStudio or Eclipse StatET. Others, such as jamovi or the R Commander, offer little more than a simple text editor.

While BlueSky’s main mission is to make their point-and-click GUI comprehensive, it does include a basic program editor which supports the writing and debugging of code. The code editor is hidden at start-up, but an arrow at the upper right corner of the output window will pop open the code editor at any time (and pop it closed, if already open). A click on the Syntax button in any dialog box will also pop the code editor open.

The code editor supports syntax highlighting, and it can collapse and expand blocks of code. It also offers some hints on function name completion. For example, typing “m" will cause it to offer “min" and “max" functions, but oddly enough, it will not offer “mean" or “median." It doesn’t provide hints on argument names or values, nor does it offer to complete object names. RStudio and RKWard both offer much more support for coders.

However, the lack of features for coders offers a benefit to GUI users: nearly all the menus and their entries are focused on GUI use. In this regard, BlueSky is the mirror image of RKWard, which has several menus full of features that only coders use.

Reproducibility & Sharing

One of the biggest challenges that GUI users face is being able to reproduce their work. Reproducibility is useful for re-running everything on the same dataset if you find a data entry error. It’s also useful for applying your work to new datasets so long as they use the same variable names (or the software can handle name changes). Some scientific journals ask researchers to submit their files (usually code and data) along with their written report so that others can check their work.

As important a topic as it is, reproducibility is a problem for GUI users, a problem that has only recently been solved by some software developers. Most GUIs (e.g. the R Commander, Rattle) save only code, but since the GUI user didn’t write the code, they also can’t read it or change it! Others such as jamovi, RKWard, and the newest version of SPSS save the dialog box entries and allow GUI users to have reproducibility in the form they prefer.

BlueSky offers only code-based reproducibility. There’s no way to get back to a filled-in dialog box when starting from the saved code.

If you wish to share your work with a colleague, you would send them the code and your data set. They could then install the appropriate version of BlueSky to run it. They could also install the “BlueSky Statistics R Package", enabling them to run the code in any R environment. At the moment, that package is only available for download from the company web site. However, the developers plan on moving it to CRAN eventually.

Package Mangement

A topic related to reproducibility is package management. One of the major advantages to the R language is that it’s very easy to extend its capabilities through add-on packages. However, updates in these packages may break a previously functioning analysis. Years from now you may need to run a variation of an analysis, which would require you to find the version of R you used, plus the packages you used at the time. As a GUI user, you’d also need to find the version of the GUI that was compatible with that version of R.

Some GUIs, such as the R Commander and Deducer, depend on you to find and install R. For them, the problem of long-term stability yours to solve. Others, such as jamovi, distribute their own version of R, and all R packages, but not their add-on modules. This requires a bigger installation file, but it makes dealing with long-term stability simpler. Of course, this depends on all major versions being around for long-term, but for open-source software, there are usually multiple archives available to store software even if the original project is defunct.

BlueSky’s approach to package management is the most comprehensive of the R GUIs reviewerd here. It provides everything you need in a single download. This includes the BlueSky interface, R itself, all R packages, and all BlueSky plug-ins. If you have a problem reproducing a BlueSky analysis in the future, all you need to do is download the version used when you created it.

Output & Report Writing

Ideally, output should be clearly labeled, well organized, and of publication quality. It might also delve into the realm of word processing through Sweave/knitr and Rmarkdown documents. At the moment, none of the GUIs covered in this series of reviews meets all of these requirements. See the separate reviews to see how each of the other packages is doing on this topic.

The labels for each of BlueSky’s analyses are provided by its menu title, e.g. Linear Regression. However, double-clicking on the title in the output switches it into edit mode where you can change it to anything you like. Unfortunately, there is no way to add comments or notes in the output, but of course you can do so in the code that it generates in the program editor.

The organization of the output is in time-order only, and you cannot delete any of the steps you take. This often results in a messy output file filled with unneeded results. A table of contents will pop out of the left side of the output window when you choose “Layout> Show Navigation Tree.” While such tables of contents are commonly used in GUIs to let you re-order, rename, or delete bits of output, those tasks are not possible here. There you can un-check any output to hide it, but it’s not deleted. You are better off keeping a word processing file open to paste in the results you want to keep.

BlueSky’s output quality is very high, with nice fonts of your choosing and true rich text tables (see Figure 5). To have them display using the popular style of the American Psychological Association (see Table 1) save the setting: “Options> Configuration Settings> Others> Show output tables in APA style." From that point on, all your output tables will use APA format. That means you can right-click on any table and choose “Export to Word (or Excel)" and the formatting is retained. That really helps speed your work as R output defaults to mono-spaced fonts that require additional steps to get into publication form (e.g. using functions from packages such as xtable or texreg). You can also choose “Copy to Clipboard", but pasting from there into Word will lose the full formatting, while still remaining a true table. All the output is stored in a single file, which can be exported to PDF and from there edited in Microsoft Word.

A nice feature of BlueSky’s output tables is that they are all interactive. So if you have a complex model you’re studying, you can easily sort the output by p-value, or parameter size, or any column you choose. That’s a nice and fairly unique feature.

Figure 5. Publication-quality output created by BlueSky.

Group-By Analyses

Repeating an analysis on different groups of observations is a core task in data science. Software needs to provide an ability to select a subset one group to analyze, then another subset to compare it to. All the R GUIs reviewed in this series can do this task. BlueSky does single-group selections in “Data> Subset". It generates a subset that you can analyze in the same way as the entire dataset.

Software also needs the ability to automate such selections so that you might generate dozens of analyses, one group at a time. While this has been available in commercial GUIs for decades (e.g. SPSS split-file), BlueSky is the only R GUI that includes this feature. BlueSky automates group-by analyses under “Split> For Analysis> Split". All analyses that follow will be done repeatedly for each level of the factors(s) chosen. This feature is turned off via “Split> For Analysis> Remove Split."

Output Management

Early in the development of statistical software, developers tried to guess what output would be important to save to a new dataset (e.g. predicted values, factor scores), and the ability to save such output was built into the analysis procedures themselves. However, researchers were far more creative than the developers anticipated. To better meet their needs, output management systems were created and tacked on to existing tools (e.g. SAS’ Output Delivery System, SPSS’ Output Management System). One of R’s greatest strengths is that every bit of output can be readily used as input. However, for the simplification that GUIs provide, that’s a challenge.

Output data can be observation-level, such as predicted values for each observation or case. When group-by analyses are run, the output data can also be observation-level, but now the (e.g.) predicted values would be created by individual models for each group, rather than one model based on the entire original data set (perhaps with group included as a set of indicator variables).

Group-by analyses can also create model-level data sets, such as one R-squared value for each group’s model. They can also create parameter-level data sets, such as the p-value for each regression parameter for each group’s model. (Saving and using single models is covered under “Modeling" above.)

For example, in our organization, we have 250 departments and want to see if any of them have a gender bias on salary. We write all 250 regression models to a data set, and then search to find those whose gender parameter is significant (hoping to find none, of course!)

BlueSky is the only R GUI reviewed here that does all three levels of output management. To use this function, choose “Model Fitting> Summarizing models for each group", then specify the model and the grouping factor. It automatically creates three data sets, one at each level of analysis. This ability works only regression, ANOVA, and multinomial logistic models. More are planned for future versions.

While BlueSky is ahead of the GUI pack in output management, the approach listed above still makes judgment calls about what output is useful for further analysis. What would you do to analyze an output table not covered by the above methods? Recall that all BlueSky output tables are true tables that can be exported to Word or Excel. Using that approach, you could save any table you like, export it and then open it as a data set to analyze. It’s not the most elegant approach, but it is quite comprehensive.

Developer Issues

There are 2 ways developers can contribute to the open source project

  1. Developers who want to add/modify the application e.g. provide new right click controls, integration into big data libraries like Hadoop and Spark, can download the source code from https://github.com/BlueSkyStatistics/BlueSkyRepository.
  2. Programmers who want to add new statistical analysis to BlueSky Statistics should watch training videos on the dialog editor program.

Conclusion

BlueSky Statistics offers an extensive set of tools that are easy for a point-and-click user to use. If you’re looking for a GUI that lets you do the most using just menus and dialog boxes, BlueSky should be on your list of software to try. BlueSky and R Commander are both way out in front of the R GUI competition when it comes to breadth of coverage in data management, graph types, and methods of analysis. I encourage you to read both reviews carefully when choosing between these two. Also keep in mind that while jamovi is newer and currently has fewer features, its developers are adding new ones at a rapid pace.

For a summary of all my R GUI software reviews, see the article, R Graphical User Interface Comparison.

Acknowledgements

Thanks to the BlueSky team who have done a lot of hard work and made all but the terminal server version of it free and open source. Thanks also to Rachel Ladd, Ruben Ortiz, Christina Peterson, and Josh Price for their editorial suggestions.

privacy policy