Before you can analyze data, it must be in the right form. Getting it into that form is often where we spend most of our time. This two-day workshop shows how to perform the most commonly used data management tasks in R. We will cover how to use R’s most popular add-on packages (dplyr, stringr, lubridate, tidyr, broom, compare, sqldf, etc.) and compare them to R’s older built-in functions.

Most of our time will be spent working through examples that you may run simultaneously on your computer. You will see both the instructor’s screen and yours, side-by-side, as we run the examples and discuss the output. However, the handouts include each step and its output, so feel free to skip the computing; it’s easy to just relax and take notes.

Most of the examples come from the highly-regarded books by the instructor, *R for SAS and SPSS Users *and *R for Stata Users*. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

The workshops are available on-site or via webinar.

The on-site workshops are the most thorough since direct face-to-face interaction is the most flexible. The instructor presents a topic for around twenty minutes. Then we switch to exercises, which are already open in another tabbed window. The exercises contain hints that show the general structure of the solution that you adapt to get the final solution. The complete solutions are in a third tabbed window, so if you get stuck the answers are a click away. There is plenty of time to handle in-depth questions on any of the topics covered, and the discussion often veers off into a broad range of interesting areas. The usual schedule for an on-site workshop is here.

The webinar version is particularly easy to work into a busy schedule. It’s offered in two half-day sessions with a day or two skipped in between to give participants a chance to do the exercises on their own and catch up on other work. There is time for questions on the lecture topics (live) and the exercises (via email). The lecture is recorded and available for review for 30 days.

For further details or to arrange a site visit, contact the instructor, Bob Muenchen, at muenchen.bob@gmail.com.

**Prerequisites**

Attendees should know basic R programming, including how to read data files and call functions.

**Learning Outcomes**

When finished, participants will be able to prepare most data sets for analysis.

**Presenter**

Robert A. Muenchen is the author of *R for SAS and SPSS Users* and, with Joseph M. Hilbe, *R for Stata Users*. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an ASA Accredited Professional Statistician™ with 33 years of experience and is currently the manager of OIT Research Computing Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has taught workshops on research computing topics for more than 500 organizations and currently offers training in partnership with DataCamp.com, Revolution Analytics, RStudio, New Horizon’s Computer Learning Centers, and Xerox Learning Services. Bob has written or coauthored over 70 articles published in scientific journals and conference proceedings, and has provided guidance on more than 1,000 graduate theses and dissertations.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, Intuitics, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analytics, and data mining.

**Computer Requirements**

On-site training is best done in a computer lab with a projector and, for large rooms, a PA system. The interactive video version requires only a web browser and an Internet connection fast enough to display video.

Course programs, data, and exercises will be sent a week before the workshop. The instructions include installing R, which you can download R for free here: http://www.r-project.org/. We will also use RStudio, which you can download for free here: http://RStudio.com. If you already know a different R editor, that’s fine too.

**Course Outline **

1. INTRODUCTION

1.1 Topics

1.2 Preparing Your Computer

1.3 Note to System Administrators

2. INTRODUCTION TO THE TIDYVERSE

2.1 Tidyverse Packages

2.2 Tibble Creation

2.3 Tibbles Improve Printing

2.4 Other Tibble Advantages

2.5 Tibble Disadvantages

2.6 Tibble Conversions

2.7 The dplyr Package’s Verbs

2.8 dplyr Input & Output

2.9 Practice Time

3. CHOOSING VARIABLES AND OBSERVATIONS

3.1 Using Subscripts

3.2 Using dplyr Functions

3.3 Variations on select

3.4 Dropping Variables

3.5 Table of Logical Comparisons

3.6 Practice Time

4. COMBINING PROGRAMMING STEPS

4.1 Nesting Only

4.2 Piping

4.3 Saving Results for Re-use

4.4 Piping Details

4.5 Piping to a Specific Argument

4.6 Think About Your Steps

4.7 Summary

4.8 Practice Time

5. COPYING & DELETING OBJECTS

5.1 Copying Objects

5.2 Copying Variables

5.3 Removing/Dropping/Deleting Variables

5.4 Removing/Dropping/Deleting Entire Objects

5.5 Practice Time

6. RENAMING DATA SETS, VARIABLES, & ROWS

6.1 Renaming Objects

6.2 Renaming Big Objects

6.3 Renaming Variables with dplyr

6.4 Renaming All Variables Using “names”

6.5 Copying Names From Another Data Frame

6.6 Renaming a Block of Names, Step 1

6.7 Renaming a Block of Names, Step 2

6.8 Renaming Thousands of Variables

6.9 Choosing Best Variable Renaming Method

6.10 Renaming Rows

6.11 The tibble Approach to Row Names

6.12 Practice Time

7. TRANSFORMING VARIABLES

7.1 Prepare the Workspace

7.2 Using Classic Dollar Format

7.3 An Easier Way: mutate

7.4 mutate & transmute Details

7.5 Row-Specific Functions

7.6 The Base apply Function

7.7 apply Function Details

7.8 Many Variables, One Transformation

7.9 mutate_at Details

7.10 Table of Transformations

7.11 Practice Time

8. CONDITIONAL TRANSFORMATIONS

8.1 Prepare the Workspace

8.2 The ifelse Function

8.3 Recode Using ifelse

8.4 Recoding Many Variables with ifelse

8.5 The car::Recode Function

8.6 Recode Many Variables

8.7 Integers vs. Double Precision

8.8 Practice Time

9. SUMMARIZING VARIABLES

9.1 Prepare the Workspace

9.2 The “summarise” Function

9.3 summarise Details

9.4 Many R Functions Require Vectors

9.5 dplyr::summarise_at Function

9.6 summarise_at Details

9.7 Built-In Summary Functions

9.8 dplyr Summary Functions

9.9 dplyr Summary Combination Functions

9.10 dplyr Sequence Functions

9.11 dplyr Rank Functions

9.12 Comparison of mutate and summarise

9.13 Practice Time

10. GROUP-BY CALCULATIONS

10.1 Prepare the Workspace

10.2 The group_by Function

10.3 Printing Grouped Data

10.4 Review of mutate

10.5 mutate By Group

10.6 Summarisation By Group

10.7 summarise By Group

10.8 summarise_at By Group

10.9 Group By Next Level

10.10 Group By Next Level…Again!

10.11 Un-Grouping

10.12 Practice Time

11. GROUP-BY ANALYSIS WITH OUTPUT MANAGEMENT

11.1 Prepare the Workspace

11.2 R’s Built-in Approach

11.3 Recall How t.test Works

11.4 broom Package Cleans it Up

11.5 Simple Analysis with group_by

11.6 dplyr’s do Function

11.7 broom’s Functions

11.8 Model-Level Regression by Group

11.9 Coefficient-Level Regression by Group

11.10 Observation-Level Regression By Group

11.11 Advanced Features

11.12 Practice Time

12. SORTING DATA

12.1 Prepare the Workspace

12.2 R’s Various Ways to Sort

12.3 When Sorting is Needed in R

12.4 Data Not Sorted by Workshop

12.5 dplyr::arrange Sorts Data Frames

12.6 desc Does Descending Order

12.7 Sorting by Two Variables

12.8 R’s built-in sort Function

12.9 R’s order Function

12.10 Using order to Sort Data Frames

12.11 rev Function Reverses order

12.12 order by Two Variables

12.13 How Location Affects Sorting

12.14 Practice Time

13. SELECTING FIRST OR LAST OBSERVATION PER GROUP

13.1 Prepare the Workspace

13.2 When to Search for These Observations

13.3 When it’s Not Needed

13.4 dplyr’s slice Function

13.5 Finding Min/Max Observation Using Sorting

13.6 Finding Min/Max Observation Using filter

13.7 Finding Min/Max Observation Using Ranks

13.8 dplyr Ranking Functions

13.9 Practice Time

14. STACKING DATA SETS

14.1 Prepare the Workspace

14.2 Creating a Data Frame to Stack

14.3 Creating a 2nd Data Frame to Stack

14.4 Stacking with dplyr::bind_rows

14.5 R’s Built-in rbind

14.6 R’s Built-in union

14.7 Practice Time

15. FINDING AND REMOVING DUPLICATE OBSERVATIONS

15.1 Prepare the Workspace

15.2 Create Some Duplicates

15.3 Locating Duplicates

15.4 Generate Duplicate Report

15.5 Removing Duplicates

15.6 Checking Subsets of Variables

15.7 Practice Time

16. MERGING / JOINING DATA FRAMES

16.1 Prepare the Workspace

16.2 Creating a Data Frame to Join

16.3 Creating a 2nd Data Frame to Join

16.4 Join by Common Variables

16.5 Joining by Different Variables

16.6 Types of Joins

16.7 Practice Time

17. RESHAPING DATA FRAMES

17.1 Prepare the Workspace

17.2 Transposing Rows and Columns

17.3 Example Wide Data Structure

17.4 Advantages of Wide Data

17.5 The Long Data Structure

17.6 Advantages of Long Data

17.7 Reshaping Options in R

17.8 Gathering Wide to Long

17.9 Wide to Long Details

17.10 Spreading Long to Wide

17.11 Extracting Numeric Values

17.12 Practice Time

18. COMPARING OBJECTS

18.1 Prepare the Workspace

18.2 Comparing Vectors

18.3 Comparing Data Frames

18.4 Mixing Up a Data Frame

18.5 Three Ways to Compare

18.6 The compare Package

18.7 Visual Comparison

18.8 The compareDF Package

18.9 Practice Time

19. CHARACTER STRING MANIPULATIONS

19.1 Prepare the Workspace

19.2 The stringr Package

19.3 Regular Expression References

19.4 Generating Numeric Variable Names

19.5 Impact of Trailing Blanks

19.6 Trimming Blanks

19.7 Setting Case

19.8 Splitting at a Column

19.9 Splitting at a Blank

19.10 Extracting Vectors

19.11 Replacing Strings

19.12 Combining Strings

19.13 Finding: One Sub-string

19.14 Finding: Multiple Sub-strings

19.15 Finding: with Regular Expressions

19.16 Finding: with Table Lookups

19.17 The stringi Package

19.18 Practice Time

20. DATE & TIME MANIPULATIONS

20.1 Prepare the Workspace

20.2 Converting Strings to Dates

20.3 Subtracting Dates

20.4 The difftime Function

20.5 Converting Time Differences to Numeric

20.6 Measuring Time Until Today

20.7 Extracting Years, Weeks, Months

20.8 Extracting Days

20.9 Choosing Observations by Date

20.10 Dealing with 2-Digit Years

20.11 Date-Time References

20.12 Practice Time

21. USING SQL WITHIN R

21.1 Prepare the Workspace

21.2 The sqldf Package

21.3 Printing a Data Frame

21.4 Choosing and Sorting

21.5 Aggregating by Gender

21.6 Key Syntax Differences

21.7 Practice Time

22. CONCLUSION

22.1 Brief Review

22.2 Providing Feedback

22.3 Future Support

22.4 Question Time

Here is a slide show of previous workshops.

Pingback: Webinar: Managing Data with R | r4stats.com

Pingback: Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More! | r4stats.com

Pingback: R Workshops Updated to Include the Latest Packages | r4stats.com

Pingback: Group-By Modeling in R Made Easy | r4stats.com

Pingback: Group-By Modeling in R Made Easy | A bunch of data