Managing Data with R

Before you can analyze data, it must be in the right form. Getting it into that form is often where we spend most of our time. This two-day workshop shows how to perform the most commonly used data management tasks in R. We will cover how to use R’s most popular add-on packages (dplyr, stringr, lubridate, tidyr, broom, compare, sqldf, etc.) and compare them to R’s older built-in functions.

R--143

Most of our time will be spent working through examples that you may run simultaneously on your computer. You will see both the instructor’s screen and yours, side-by-side, as we run the examples and discuss the output. However, the handouts include each step and its output, so feel free to skip the computing; it’s easy to just relax and take notes.

Most of the examples come from the highly-regarded books by the instructor, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

The workshops are available on-site or via webinar.

The on-site workshops are the most thorough since direct face-to-face interaction is the most flexible. The instructor presents a topic for around twenty minutes. Then we switch to exercises, which are already open in another tabbed window. The exercises contain hints that show the general structure of the solution that you adapt to get the final solution. The complete solutions are in a third tabbed window, so if you get stuck the answers are a click away. There is plenty of time to handle in-depth questions on any of the topics covered, and the discussion often veers off into a broad range of interesting areas. The usual schedule for an on-site workshop is here.

The webinar version is particularly easy to work into a busy schedule. It’s offered in two half-day sessions with a day or two skipped in between to give participants a chance to do the exercises on their own and catch up on other work. There is time for questions on the lecture topics (live) and the exercises (via email). The lecture is recorded and available for review for 30 days.

For further details or to arrange a site visit, contact the instructor, Bob Muenchen, at muenchen.bob@gmail.com.

Prerequisites

Attendees should know basic R programming, including how to read data files and call functions.

Learning Outcomes

When finished, participants will be able to prepare most data sets for analysis.

Presenter

Robert A. Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an ASA Accredited Professional Statistician™ with 33 years of experience and is currently the manager of OIT Research Computing Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has taught workshops on research computing topics for more than 500 organizations and currently offers training in partnership with DataCamp.com, Revolution Analytics, RStudio, New Horizon’s Computer Learning Centers, and Xerox Learning Services. Bob has written or coauthored over 70 articles published in scientific journals and conference proceedings, and has provided guidance on more than 1,000 graduate theses and dissertations.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, Intuitics, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analytics, and data mining.

Computer Requirements

On-site training is best done in a computer lab with a projector and, for large rooms, a PA system. The interactive video version requires only a web browser and an Internet connection fast enough to display video.

Course programs, data, and exercises will be sent a week before the workshop. The instructions include installing R, which you can download R for free here: http://www.r-project.org/. We will also use RStudio, which you can download for free here: http://RStudio.com. If you already know a different R editor, that’s fine too.

Course Outline

1. INTRODUCTION
1.1 Topics
1.2 Preparing Your Computer
1.3 Note to System Administrators

2. INTRODUCTION TO THE TIDYVERSE
2.1 Tidyverse Packages
2.2 Tibble Creation
2.3 Tibbles Improve Printing
2.4 Other Tibble Advantages
2.5 Tibble Disadvantages
2.6 Tibble Conversions
2.7 The dplyr Package’s Verbs
2.8 dplyr Input & Output
2.9 Practice Time

3. CHOOSING VARIABLES AND OBSERVATIONS
3.1 Using Subscripts
3.2 Using dplyr Functions
3.3 Variations on select
3.4 Dropping Variables
3.5 Table of Logical Comparisons
3.6 Practice Time

4. COMBINING PROGRAMMING STEPS
4.1 Nesting Only
4.2 Piping
4.3 Saving Results for Re-use
4.4 Piping Details
4.5 Piping to a Specific Argument
4.6 Think About Your Steps
4.7 Summary
4.8 Practice Time

5. COPYING & DELETING OBJECTS
5.1 Copying Objects
5.2 Copying Variables
5.3 Removing/Dropping/Deleting Variables
5.4 Removing/Dropping/Deleting Entire Objects
5.5 Practice Time

6. RENAMING DATA SETS, VARIABLES, & ROWS
6.1 Renaming Objects
6.2 Renaming Big Objects
6.3 Renaming Variables with dplyr
6.4 Renaming All Variables Using “names"
6.5 Copying Names From Another Data Frame
6.6 Renaming a Block of Names, Step 1
6.7 Renaming a Block of Names, Step 2
6.8 Renaming Thousands of Variables
6.9 Choosing Best Variable Renaming Method
6.10 Renaming Rows
6.11 The tibble Approach to Row Names
6.12 Practice Time

7. TRANSFORMING VARIABLES
7.1 Prepare the Workspace
7.2 Using Classic Dollar Format
7.3 An Easier Way: mutate
7.4 mutate & transmute Details
7.5 Row-Specific Functions
7.6 The Base apply Function
7.7 apply Function Details
7.8 Many Variables, One Transformation
7.9 mutate_at Details
7.10 Table of Transformations
7.11 Practice Time

8. CONDITIONAL TRANSFORMATIONS
8.1 Prepare the Workspace
8.2 The ifelse Function
8.3 Recode Using ifelse
8.4 Recoding Many Variables with ifelse
8.5 The car::Recode Function
8.6 Recode Many Variables
8.7 Integers vs. Double Precision
8.8 Practice Time

9. SUMMARIZING VARIABLES
9.1 Prepare the Workspace
9.2 The “summarise" Function
9.3 summarise Details
9.4 Many R Functions Require Vectors
9.5 dplyr::summarise_at Function
9.6 summarise_at Details
9.7 Built-In Summary Functions
9.8 dplyr Summary Functions
9.9 dplyr Summary Combination Functions
9.10 dplyr Sequence Functions
9.11 dplyr Rank Functions
9.12 Comparison of mutate and summarise
9.13 Practice Time

10. GROUP-BY CALCULATIONS
10.1 Prepare the Workspace
10.2 The group_by Function
10.3 Printing Grouped Data
10.4 Review of mutate
10.5 mutate By Group
10.6 Summarisation By Group
10.7 summarise By Group
10.8 summarise_at By Group
10.9 Group By Next Level
10.10 Group By Next Level…Again!
10.11 Un-Grouping
10.12 Practice Time

11. GROUP-BY ANALYSIS WITH OUTPUT MANAGEMENT
11.1 Prepare the Workspace
11.2 R’s Built-in Approach
11.3 Recall How t.test Works
11.4 broom Package Cleans it Up
11.5 Simple Analysis with group_by
11.6 dplyr’s do Function
11.7 broom’s Functions
11.8 Model-Level Regression by Group
11.9 Coefficient-Level Regression by Group
11.10 Observation-Level Regression By Group
11.11 Advanced Features
11.12 Practice Time

12. SORTING DATA
12.1 Prepare the Workspace
12.2 R’s Various Ways to Sort
12.3 When Sorting is Needed in R
12.4 Data Not Sorted by Workshop
12.5 dplyr::arrange Sorts Data Frames
12.6 desc Does Descending Order
12.7 Sorting by Two Variables
12.8 R’s built-in sort Function
12.9 R’s order Function
12.10 Using order to Sort Data Frames
12.11 rev Function Reverses order
12.12 order by Two Variables
12.13 How Location Affects Sorting
12.14 Practice Time

13. SELECTING FIRST OR LAST OBSERVATION PER GROUP
13.1 Prepare the Workspace
13.2 When to Search for These Observations
13.3 When it’s Not Needed
13.4 dplyr’s slice Function
13.5 Finding Min/Max Observation Using Sorting
13.6 Finding Min/Max Observation Using filter
13.7 Finding Min/Max Observation Using Ranks
13.8 dplyr Ranking Functions
13.9 Practice Time

14. STACKING DATA SETS
14.1 Prepare the Workspace
14.2 Creating a Data Frame to Stack
14.3 Creating a 2nd Data Frame to Stack
14.4 Stacking with dplyr::bind_rows
14.5 R’s Built-in rbind
14.6 R’s Built-in union
14.7 Practice Time

15. FINDING AND REMOVING DUPLICATE OBSERVATIONS
15.1 Prepare the Workspace
15.2 Create Some Duplicates
15.3 Locating Duplicates
15.4 Generate Duplicate Report
15.5 Removing Duplicates
15.6 Checking Subsets of Variables
15.7 Practice Time

16. MERGING / JOINING DATA FRAMES
16.1 Prepare the Workspace
16.2 Creating a Data Frame to Join
16.3 Creating a 2nd Data Frame to Join
16.4 Join by Common Variables
16.5 Joining by Different Variables
16.6 Types of Joins
16.7 Practice Time

17. RESHAPING DATA FRAMES
17.1 Prepare the Workspace
17.2 Transposing Rows and Columns
17.3 Example Wide Data Structure
17.4 Advantages of Wide Data
17.5 The Long Data Structure
17.6 Advantages of Long Data
17.7 Reshaping Options in R
17.8 Gathering Wide to Long
17.9 Wide to Long Details
17.10 Spreading Long to Wide
17.11 Extracting Numeric Values
17.12 Practice Time

18. COMPARING OBJECTS
18.1 Prepare the Workspace
18.2 Comparing Vectors
18.3 Comparing Data Frames
18.4 Mixing Up a Data Frame
18.5 Three Ways to Compare
18.6 The compare Package
18.7 Visual Comparison
18.8 The compareDF Package
18.9 Practice Time

19. CHARACTER STRING MANIPULATIONS
19.1 Prepare the Workspace
19.2 The stringr Package
19.3 Regular Expression References
19.4 Generating Numeric Variable Names
19.5 Impact of Trailing Blanks
19.6 Trimming Blanks
19.7 Setting Case
19.8 Splitting at a Column
19.9 Splitting at a Blank
19.10 Extracting Vectors
19.11 Replacing Strings
19.12 Combining Strings
19.13 Finding: One Sub-string
19.14 Finding: Multiple Sub-strings
19.15 Finding: with Regular Expressions
19.16 Finding: with Table Lookups
19.17 The stringi Package
19.18 Practice Time

20. DATE & TIME MANIPULATIONS
20.1 Prepare the Workspace
20.2 Converting Strings to Dates
20.3 Subtracting Dates
20.4 The difftime Function
20.5 Converting Time Differences to Numeric
20.6 Measuring Time Until Today
20.7 Extracting Years, Weeks, Months
20.8 Extracting Days
20.9 Choosing Observations by Date
20.10 Dealing with 2-Digit Years
20.11 Date-Time References
20.12 Practice Time

21. USING SQL WITHIN R
21.1 Prepare the Workspace
21.2 The sqldf Package
21.3 Printing a Data Frame
21.4 Choosing and Sorting
21.5 Aggregating by Gender
21.6 Key Syntax Differences
21.7 Practice Time

22. CONCLUSION
22.1 Brief Review
22.2 Providing Feedback
22.3 Future Support
22.4 Question Time

Here is a slide show of previous workshops.

5 Responses to Managing Data with R

  1. Pingback: Webinar: Managing Data with R | r4stats.com

  2. Pingback: Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More! | r4stats.com

  3. Pingback: R Workshops Updated to Include the Latest Packages | r4stats.com

  4. Pingback: Group-By Modeling in R Made Easy | r4stats.com

  5. Pingback: Group-By Modeling in R Made Easy | A bunch of data

Leave a Reply

privacy policy