The BlueSky Statistics graphical user interface (GUI) for the R language has added quite a few new features (described below). I’m also working on a *BlueSky User’s Guide*, a draft of which you can read about and download here. Although I’m spending a lot of time on BlueSky, I still plan to be as obsessive as ever about reviewing all (or nearly all) of the R GUIs, which is summarized here.

The new data management features in BlueSky are:

- Date Order Check — this lets you quickly check across the dates stored in many variables, and it reports if it finds any rows whose dates are not always increasing from left to right.
- Find Duplicates – generates a report of duplicates and saves a copy of the data set from which the duplicates are removed. Duplicates can be based on all variables, or a set of just ID variables.
- Select First/Last Observation per Group – finding the first or last observation in a group can create new datasets from the “best” or “worst” case in each group, find the most current record, and so on.

**Model Fitting / Tuning**

One of the more interesting features in BlueSky is its offering of what they call Model Fitting and Model Tuning. Model Fitting gives you direct control over the R function that does the work. That provides precise control over every setting, and it can teach you the code that the menus create, but it also means that model tuning is up to you to do. However, it does standardize scoring so that you do not have to keep up with the wide range of parameters that each of those functions need for scoring. Model Tuning controls models through the caret package, which lets you do things like K-fold cross-validation and model tuning. However, it does not allow control over *every *model setting.

New Model Fitting menu items are:

- Cox Proportional Hazards Model: Cox Single Model
- Cox Multiple Models
- Cox with Formula
- Cox Stratified Model
- Extreme Gradient Boosting
- KNN
- Mixed Models
- Neural Nets: Multi-layer Perceptron
- NeuralNets (i.e. the package of that name)
- Quantile Regression

There are so many Model Tuning entries that it’s easier to just paste in the list I updated on the main BlueSkly review that I updated earlier this morning:

- Model Tuning: Adaboost Classification Trees
- Model Tuning: Bagged Logic Regression
- Model Tuning: Bayesian Ridge Regression
- Model Tuning: Boosted trees: gbm
- Model Tuning: Boosted trees: xgbtree
- Model Tuning: Boosted trees: C5.0
- Model Tuning: Bootstrap Resample
- Model Tuning: Decision trees: C5.0tree
- Model Tuning: Decision trees: ctree
- Model Tuning: Decision trees: rpart (CART)
- Model Tuning: K-fold Cross-Validation
- Model Tuning: K Nearest Neighbors
- Model Tuning: Leave One Out Cross-Validation
- Model Tuning: Linear Regression: lm
- Model Tuning: Linear Regression: lmStepAIC
- Model Tuning: Logistic Regression: glm
- Model Tuning: Logistic Regression: glmnet
- Model Tuning: Multi-variate Adaptive Regression Splines (MARS via earth package)
- Model Tuning: Naive Bayes
- Model Tuning: Neural Network: nnet
- Model Tuning: Neural Network: neuralnet
- Model Tuning: Neural Network: dnn (Deep Neural Net)
- Model Tuning: Neural Network: rbf
- Model Tuning: Neural Network: mlp
- Model Tuning: Random Forest: rf
- Model Tuning: Random Forest: cforest (uses ctree algorithm)
- Model Tuning: Random Forest: ranger
- Model Tuning: Repeated K-fold Cross-Validation
- Model Tuning: Robust Linear Regression: rlm
- Model Tuning: Support Vector Machines: svmLinear
- Model Tuning: Support Vector Machines: svmRadial
- Model Tuning: Support Vector Machines: svmPoly

You can download the free open-source version from https://BlueSkyStatistics.com.

Wow, impressive work! Would it make sense to integrate the OneR package as a baseline model for classification? See also: https://cran.r-project.org/web/packages/OneR/vignettes/OneR.html or for a quick intro: https://blog.ephorie.de/oner-machine-learning-in-under-one-minute.

Hi Holger,

I’m familiar with OneR and I’m very much in favor of its goal. However, I’m under the impression that CORELS is almost certain to find a better model that is just as interpretable. Check out the video here: https://www.youtube.com/watch?v=ebJHnDLLTKA. This has only been added to R in the past month or so: https://github.com/eddelbuettel/tidycorels.

Cheers,

Bob

Hi Bob,

I would respectfully disagree. Just take the example given there:

OneR gives:

—

Rules:

If wt = (1.51,2.99] then am = 1

If wt = (2.99,5.43] then am = 0

Accuracy:

29 of 32 instances classified correctly (90.62%)

—

Very high accuracy and rules are simpler!

best

h

Hi Holger,

Thanks for the comparison. That CORELS example got 31/32 or 97% accuracy, though using a more complex rule set. One of the problems with CORELS is that all continuous variables must be dichotomized. Tidymodels makes that easy to do, but then that step is moved from the modeling step to the data preparation step. That made me wonder if combining CORELS with another method that chose better cut-points for the numeric variables wouldn’t improve its accuracy. Also, by manually setting labels for the dummy variables, the rules could be more easily interpreted, like “If wt=heavy…” instead of “if wt=bin1…”.

Cheers,

Bob

Yes, in OneR all of this is done in two lines of code and completely automatically:

—

data <- optbin(am ~., data = mtcars)

OneR(data, verbose = TRUE)

—

Also, the exact cut-points are optimized on the fly.