Text Analysis with R

I enjoyed teaching R workshops for many years, but I have retired from teaching them. I leave the workshop pages up just to let people know.

The R language offers many ways to perform various analyses on text data such as text mining, content analysis, topic modeling, semantic analysis and analysis of style (stylometry), including forensic analysis to determine authorship. This full-day workshop covers those topics using three popular approaches.

Photography by Steve Chastain http://www.stevechastainphotography.com/

Dictionary-based content analysis is the simplest: develop a “dictionary” of words and phrases that define various topics or concepts, and have the computer “code” your documents to uncover the topics they contain. In the end, you know the number or percent of documents that included each topic. This simple approach can succeed where more advanced methods fail, such as with smaller sets of documents, or short-answer survey items that lack the co-occurrence of terms that other methods rely on. In addition, this approach also excels at preparing data for more complex approaches.

Latent Semantic Analysis (LSA) identifies the topics in a set of documents by automatically detecting sets of words that define each topic. For example, if it sees the words: judge, jury, court, and lawyer frequently appearing near one another, it will create a numeric variable that measures the amount of that topic in each document. Seeing that combination, you might name the new variable “justice.” It essentially applies the method of factor analysis to text data.

Latent Dirichlet Allocation (LDA) results in a similar set of numeric measures for each topic, but in this case the numeric values are probabilities that each document contains the topic.

Most of our time will be spent working through examples that you may run simultaneously on your computer. You will see both the instructor’s screen and yours, as we run the examples and discuss the output. However, the handouts include each step and its output, so feel free to skip the computing; it’s easy to just relax and take notes. The slides and programming steps are numbered so you can easily switch from computing to slides and back again.

This workshop is available at your organization’s site, or via webinars.

The 0n-site version is the most engaging by far, generating much discussion and occasionally veering off briefly to cover topics specific to a particular organization. The instructor presents a topic for around twenty minutes. Then we switch to exercises, which are already open in another tabbed window. The exercises contain hints that show the general structure of the solution; you adapt those hints to get the final solution. The complete solutions are in a third tabbed window, so if you get stuck the answers are a click away. The typical schedule for training on site is located here.

A webinar version is also available. The approach is saves travel expenses and is especially useful for organizations with branch offices. It’s offered as two half-day sessions, often with a day or two skipped in between to give participants a chance to do the exercises and catch up on other work. There is time for questions on the lecture topics (live) and the exercises (via email). However, webinar participants are typically much less engaged, and far less discussion takes place.

For further details or to arrange a webinar or site visit, contact the instructor, Bob Muenchen, at muenchen.bob@gmail.com.

Prerequisites

This workshop assumes a basic knowledge of R. Introductory knowledge of statistics is helpful, but not required.

Learning Outcomes

When finished, participants will be able to use R to import documents in a variety of formats and analyze them with regard to topics or style.

Presenter

Robert A. Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an ASA Accredited Professional Statistician™ with 35 years of experience and is currently the manager of OIT Research Computing Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has taught workshops on research computing topics for more than 500 organizations and has offered training in partnership with the American Statistical AssociationDataCamp.com, New Horizons Computer Learning Centers, Revolution Analytics, RStudio and Xerox Learning Services. Bob has written or coauthored over 70 articles published in scientific journals and conference proceedings, and has provided guidance on more than 1,000 graduate theses and dissertations.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, Intuitics, the Statistical Graphics Corporation and PC Week Magazine (now eWeek). His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analytics, and data mining.

Computer Requirements

On-site training is best done in a computer lab with a projector and, for large rooms, a PA system. The webinar version is delivered to your computer using Zoom (or similar webinar systems if your organization has a preference.)

Course programs, data, and exercises will be sent to you a week before the workshop. The instructions include installing R, which you can download R for free here: http://www.r-project.org/. We will also use RStudio, which you can download for free here: http://RStudio.com. If you already know a different R editor, that’s fine too.

Course Outline

  1. Basic text analysis concepts
    1. How to define a document
    2. Corpus details
    3. Metadata details
    4. Terms / Tokens / Features
    5. Vocabulary lists
    6. Stop lists
    7. High / low frequency terms & their problems
    8. Term frequency divided by document frequency (TF/IDF)
    9. Bag of words
    10. N-grams
    11. Topics: detecting, coding, scoring
    12. Style & forensic analysis
  2. Dictionary-based Content Analysis
    1. The algorithm, pros, cons
    2. Overview of the quanteda and readtext packages
    3. Creating, viewing, summarizing, a corpus
    4. Keywords in Context (KWIC)
    5. Tokenizing words, sentences, punctuation, symbols, etc.
    6. Document-feature matrices
    7. Finding popular terms
    8. Taking advantage of text diversity measures
    9. Discovering useful phrases (n-grams)
    10. Creating & applying a phrase dictionary
    11. Advantanges & dangers of using a stop list
    12. Creating and applying a thesaurus
    13. Finding words that differentiate documents (TF/IDF)
    14. Advantages & dangers of stemming & lemmatization
    15. Creating & applying a topic dictionary
    16. Adding scores back to main data set for mixed-methods analyses
    17. Finding & studying documents with zero topics
    18. Studying topic pairs for potential combinations
    19. Studying single topics for potential splits
    20. What we can learn from “big talkers”
    21. Summarizing topics using tidytext
    22. Visualizing topics using ggplot2
    23. Visualizing topics using word clouds
    24. Extracting and studying document subsets
  3. Latent Semantic Analysis (LSA)
    1. Overview of the algorithm (detailed math optional)
    2. Pros & cons of this approach
    3. Converting quanteda’s document-feature matrices into the term-document matrices needed by the lsa package
    4. Applying local and global weights (TF/IDF, Entropy)
    5. Creating the maximum LSA space
    6. Plotting scores to estimate number of topics
    7. Creating a reduced LSA space
    8. Interpreting the topics (or factors)
    9. Adding factor scores to original data for mixed-methods analyses
    10. Scoring a new set of documents (careful!)
  4. Latent Dirichlet Allocation (detailed math optional)
    1. The algorithm, pros, cons
    2. Converting quanteda’s document-feature matrix into the document-term matrix needed by the topicmodels package
    3. Performing the analysis
    4. Finding top words for each topic using tidytext
    5. Visualizing the topics using ggplot2
    6. Combining scores with original data for mixed-methods analyses
  5. Analyzing the style of writing (stylometry)
    1. Developing a style guide
    2. Repeating above analyses based on style
    3. Determining authorship
  6. Using Standard Dictionaries
    1. Importing popular formats such as WordStat
    2. Sentiment analysis – how happy were these people?
    3. The Lie Scale – are they telling the truth?
    4. Psychological scales for depression, etc.
  7. Comparison with commercial packages
    1. WordStat
    2. SAS Text Miner
    3. SPSS Text Analytics for Surveys
  8. Summary of topics learned

Here is a slide show of previous workshops.