R has a wide variety of machine learning (ML) models. While the many ML functions solve similar problems by predicting various outcomes, they use a confusing array of different command styles, making them hard to learn. Fortunately, the caret package provides a standard approach to dozens of ML functions, greatly speeding learning and use. This two-day hands-on workshop starts with ML basics and takes you step-by-step through increasingly complex modeling styles.
Most of our time will be spent working through examples that you may run simultaneously on your computer. You will see both the instructor’s screen and yours, side-by-side, as we run the examples and discuss the output. However, the handouts include each step for people who prefer to just take notes.
This workshop is available at your organization’s site, or via webinars.
The 0n-site version is the most engaging by far, generating much discussion and occasionally veering off briefly to cover topics specific to a particular organization. The instructor presents a topic for around twenty minutes. Then we switch to exercises, which are already open in another tabbed window. The exercises contain hints that show the general structure of the solution; you adapt those hints to get the final solution. The complete solutions are in a third tabbed window, so if you get stuck the answers are a click away. The typical schedule for training on site is located here.
A webinar version is also available. The approach saves travel expenses and is especially useful for organizations with branch offices. It’s offered as two half-day sessions, often with a day or two skipped in between to give participants a chance to do the exercises on their own and catch up on other work. There is time for questions on the lecture topics (live) and the exercises (via email). However, webinar participants are typically much less engaged, and far less discussion takes place.
For further details or to arrange a webinar or site visit, contact the instructor, Bob Muenchen, at muenchen.bob@gmail.com.
Prerequisites
This workshop assumes a basic knowledge of R. Introductory knowledge of statistics is helpful, but not required.
Learning Outcomes
When finished, participants will be able to use R to apply the most popular and effective machine learning models to make predictions and assess the likely accuracy of those predictions.
Presenter
Robert A. Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular website devoted to analyzing trends in data science software, reviewing such software, and helping people learn the R language. Of the over 750 R blogs on the Internet, Feedspot rates r4stats.com the eleventh most influential.
Bob is an ASA Accredited Professional Statistician™ with 35 years of experience and is currently the manager of OIT Research Computing Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has taught workshops on research computing topics for more than 500 organizations and has presented workshops in partnership with the American Statistical Association, RStudio, DataCamp.com, New Horizons Computer Learning Centers, Revolution Analytics (acquired by Microsoft), and Xerox Learning Services. Bob has written or coauthored over 70 articles published in scientific journals and conference proceedings and has provided guidance on more than 1,000 graduate theses and dissertations.
Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation, and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, StatAce, JMP, jamovi, BlueSky Statistics, STATGRAPHICS and numerous R packages. His research interests include statistical computing, data graphics and visualization, text analytics, and data mining.
Computer Requirements
On-site training is best done in a computer lab with a projector and, for large rooms, a PA system. The webinar version is delivered to your computer using Zoom (or similar webinar systems if your organization has a preference.)
Course programs, data, and exercises will be sent to you a week before the workshop. The instructions include installing R, which you can download R for free here: http://www.r-project.org/. We will also use RStudio, which you can download for free here: http://RStudio.com. If you already know a different R editor, that’s fine too.
Course Outline
- INTRODUCTION
1.1 Topics
1.2 Preparing Your Computer
1.3 Note to System Administrators
1.4 Workshop Files
- OVERVIEW OF MACHINE LEARNING
2.1 Definition
2.2 Goals of Machine Learning
2.3 Types of Problems
2.4 Types of Machine Learning
2.5 Statistical Models
2.6 Heuristic Models
2.7 Bagging Models
2.8 Boosting Models
2.9 Data Set Size
2.10 Model Building Steps
2.11 Allocating Data: Train, Validation, & Test
2.12 Types of Splits
2.13 Fitting Issues
2.14 Feature Selection
2.15 Feature Selection Approaches
- INTRO TO THE caret PACKAGE
3.1 Formula Method vs. Non-formula
3.2 Each R Package May Predict Differently
3.3 Goals of the caret Package
3.4 caret References
3.5 Types of Models in caret
3.6 Type of Predictions by Model
3.7 Types of Functions in caret
- DATA PRE-PROCESSING
4.1 The caret::preProcess Function
4.2 preProcess Tasks
4.3 Pre-processing the titanic Dataset
4.4 Examining Factor Variables
4.5 Setting Value to Model
4.6 Examining Numeric Variables
4.6 Examining Relation to Target
4.7 Creating Some Problems to Diagnose
4.8 Counting Missing Values
4.9 Visualizing Missing Values
4.10 Imputing Missing Values
4.11 Finding High Correlations
4.12 Finding High Correlation Automatically
4.13 Getting Names of Highly Correlated vars
4.14 Finding Variables with Near Zero Variance (NZV)
4.15 Automatically Finding and Fixing Problems
4.16 caret’s Imputation Options
4.17 Fixing All Problems at Once
4.18 What is a preProcess Object?
4.19 Details of What preProcess Will Do
4.20 Box-Cox / Yeo-Johnson Lambda Values
4.21 How to Use preProcess to Fix the Problems
4.22 Effect of Yeo-Johnson Normalization
4.22 Practice Time
- PRINCIPAL COMPONENTS ANALYSIS
5.1 Prepare the Workspace
5.2 What PCA Does
5.3 Visualize Variance of Original Variables
5.4 Now Create Principal Components
5.5 Plot the Principal Components
5.6 Interpreting PC Construction
5.7 Practice Time
- DUMMY VARIABLES
6.1 Preparing the Workspace
6.2 Dummy Variables Defined
6.3 Creating Dummy Variables Manually
6.4 Creating Dummy Variables Automatically
6.5 Adding Auto-Dummies
6.6 Practice Time
- PARTITIONING DATA SETS
7.1 Preparing the Workspace
7.2 Select Rows to Train On
7.3 Create Train and Test Data Frames
- FEATURE SELECTION
8.1 Preparing the Workspace
8.2 Example Data Set
8.3 Supervised vs. Unsupervised
8.4 Built-in Feature Selection
8.5 Filter Methods
8.6 Selection By Filtering (SBF)
8.7 Random Forest SBF
8.8 Saving the Variables Chosen by SBF
8.9 Wrapper Methods
8.10 Wrapped Feature Extraction (RFE)
8.11 Recursive Feature Elimination (RFE)
8.12 Impact of Number of Variables
8.13 Saving the Variables Chosen by RFE
8.13 Practice Time
- CONTROLLING MODEL TRAINING
9.1 Prepare the Workspace
9.2 The trainControl Function
9.3 Control Arguments
9.4 Measuring Model Quality
9.5 trainControl Example
9.6 Timing Training
9.7 Practice Time
- NAIVE BAYES
10.1 Prepare the Workspace
10.2 NB Algorithm
10.3 NB Advantages
10.4 NB Disadvantages
10.5 NB Model Training
10.6 Interpreting NB Model Tables
10.7 Interpreting NB Model Plots
10.8 NB Predictions & Validation
10.9 Save Model for Future Use
10.10 Practice Time
- CLASSIFICATION AND REGRESSION TREES
11.1 Prepare the Workspace
11.2 CART & Ctree Algorithms
11.3 CART & Ctree Advantages
11.4 CART & Ctree Disadvantages
11.5 Tree Model Training
11.6 Tree Model Plot using partykit
11.7 Tree Model Plot Using rpart.plot
11.8 Tree Model Converted to Rules
11.9 Tree Variable Importance
11.10 Tree Predictions & Validation
11.11 Practice Time
- RANDOM FORESTS
12.1 Prepare the Workspace
12.2 RF Algorithm
12.3 RF Advantages
12.4 RF Disadvantages
12.5 RF Train the Model
12.6 RF Variable Importance
12.7 RF Prediction & Validation
12.8 Save RF Model for Future Use
12.9 Practice Time
- GRADIENT BOOSTING MACHINES (gbm)
13.1 Prepare the Workspace
13.2 gbm Algorithm
13.3 gbm Advantages
13.4 gbm Disadvantages
13.5 gbm Model Training
13.6 gbm Variable Importance
13.7 gbm Predictions & Validation
13.8 Practice Time
- NEURAL NETWORKS (NN)
14.1 Prepare the Workspace
14.2 Neural Network Algorithm
14.3 Neural Network Advantages
14.4 Neural Network Disadvantages
14.5 Neural Network Training
14.6 Neural Network Variable Importance
14.7 Neural Network Tuning Plot
14.8 Neural Network Prediction & Validation
14.9 Practice Time
- SUPPORT VECTOR MACHINES
15.1 Prepare the Workspace
15.2 SVM Algorithm
15.3 SVM Advantages
15.4 SVM Disadvantages
15.5 SVM Training Model
15.6 SVM Variable Importance
15.7 SVM Plots
15.8 SVM Prediction & Validation
15.9 Practice Time
- LOGISTIC REGRESSION (LR)
16.1 Prepare the Workspace
16.2 LR Algorithm
16.3 LR Advantages
16.4 LR Disadvantages
16.5 LR Training Model
16.6 LR Variable Importance
16.7 LR Interpretation
16.7 LR Prediction & Validation
16.8 Practice Time
- DISCRIMINANT ANALYSIS
17.1 Prepare the Workspace
17.2 LDA/QDA Algorithm
17.3 LDA/QDA Advantages
17.4 LDA/QDA Disadvantages
17.5 LDA Training
17.6 LDA Common Functions
17.7 LDA Interpret Model
17.8 LDA Study Scores Manually
17.9 LDA Prediction & Validation
17.10 Practice Time
- LINEAR REGRESSION ANALYSIS
18.1 Prepare the Workspace
18.2 Regression Analysis Algorithm
18.3 Regression Analysis Advantages
18.4 Regression Analysis Disadvantages
18.5 Regression Training
18.6 Regression Diagnostic Plots
18.7 Regression Analysis Model Interpretation
18.8 Linear Regression Prediction & Validation
18.9 Practice Time
- ROC CURVES
19.1 Prepare the Workspace
19.2 ROC Curve Defined
19.3 Plotting the ROC Curve
19.4 Extracting Measures from ROC Curves
19.5 Overlaying ROC Curves
19.6 Testing Significance Between ROC Curves
19.7 Practice Time
- CLASS IMBALANCE ISSUES
20.1 Prepare the Workspace
20.2 The Problem
20.3 Modeling Solutions
20.4 Changing Cutoff
20.5 Changing Prior Probabilities
20.6 Changing Case Weights
20.7 Sampling Changes
20.8 Synthetic Data
20.9 Practice Time
- MODEL TUNING GRIDS
21.1 Prepare the Workspace
21.2 Tuning Parameters
21.3 Examining rpart Parameters
21.4 An Example Tuning Grid
21.5 Training with Parameter Grid
21.6 Predicting with Tuning Grid
21.7 Practice Time
- INTERPRETING BLACK-BOX MODELS
22.1 Preparing the Workspace
22.2 Global vs. Local Interpretation
22.3 Creating a Profile Grid
22.4 The Resulting Grid
22.5 Creating a More Interpretable Model
22.6 Generating an Interpretable Model
22.7 Practice Time
- CHOOSING A MODEL
23.1 Summary of Above Models
23.2 Choose for Performance
23.3 Choose for Understanding
23.4 Choose for Ease of Deployment & Maintenance
23.5 Choose for Speed
- CONCLUSION
24.1 Brief Review
24.2 Feedback
24.3 Future Support
24.4 Question Time
Here is a slideshow of previous workshops.