R | r4stats.com

Updated: Why R is Hard to Learn

I’ve updated one of my most widely read blog posts, Why R is Hard to Learn. It focuses on the aspects of R which tend to trip up beginners. The new version is over twice as long as the original and it is located under the Articles menu, making it easier to find. Of course my new interactive workshop on DataCamp.com and my up-coming webinars with Revolution Analtyics cover these trouble spots thoroughly.

Adding the SPSS MEAN.n Function to R

SPSS contains a very useful set of functions that R lacks. If you’re lucky enough to have access to SPSS, you can use SPSS and R very well together. If not, it’s easy to add these functions to R. The functions perform calculations across values within each observation. Rather than limit you to removing missing values or not, they let you specify how many valid values you want before setting the result to missing. For example in SPSS,
MEAN.5(Q1 TO Q10) asks for the mean only if at least five of the ten variables have valid values. Otherwise the result will be a missing value. This “.n” extension is also available for SPSS’ SUM, SD, VARIANCE, MIN and MAX functions.

Let’s now take a look at how to do this in R. First we’ll create some data with different numbers of missing values for each observations.

> q1 <- c(1, 1, 1)
> q2 <- c(2, 2, NA)
> q3 <- c(3, NA, NA)
> df <- data.frame(q1, q2, q3)
> df
q1 q2  q3
1 1 2   3
2 1 2  NA
3 1 NA NA

R already has a mean function, but it lacks a function to count the number of valid values. A common way to do this in R is to use the is.na() function to generate a vector of TRUE/FALSE values for missing or not, respectively, then sum them. As with many software packages, R views TRUE as having the value 1 and FALSE as having a value of 0, so this approach gets us the number of missing values. The “!” symbol means “not” in R so !is.na() will find the number non-missing values. Here’s a function that does this:

> nvalid <- function(x) sum(!is.na(x))
> nvalid(q2)
[1] 2

So it has found that there are two valid values for q2. This nvalid() function obviously works on vectors, but we need to apply it to the rows of our data frame. We can select the first three variables using df[1:3] and then pass the result into as.matrix() to make the rows easily accessible by R’s apply() function. The apply() function’s second argument is 1 indicating that we would like to compute the mean across rows (the value 2 would indicate columns). The final arguments are the functions to apply and any arguments they need.

> means  <- apply(as.matrix(df[1:3]), 1, mean, na.rm = TRUE)
> counts <- apply(as.matrix(df[1:3]), 1, nvalid)
> means
[1] 2.0 1.5 1.0
> counts
[1] 3 2 1

We have our means and the counts of valid values, so all that remains is to choose our desired value of counts and accept the mean if the data have that value or greater, but return a missing value (NA) if not. This can be done using the ifelse() function, whose first argument is the logical condition, followed by the value desired when TRUE, then the value when false.

> means <- ifelse(counts >= 2, means, NA)
> means
[1] 2.0 1.5 NA

We’ve seen all the parts work, so all that remains is to put them together into a single function that has two arguments, one for the data frame and one for the n required.

mean.n   <- function(df, n) {
  means <- apply(as.matrix(df), 1, mean, na.rm = TRUE)
  nvalid <- apply(as.matrix(df), 1, function(df) sum(!is.na(df)))
  ifelse(nvalid >= n, means, NA)
}

Let’s test our function requiring 1, 2 and 3 valid values.

> df$mean1 <- mean.n(df[1:3], 1)
> df$mean2 <- mean.n(df[1:3], 2)
> df$mean3 <- mean.n(df[1:3], 3)
> df
q1 q2  q3 mean1 mean2 mean3
1 1 2   3   2.0   2.0     2
2 1 2  NA   1.5   1.5    NA
3 1 NA NA   1.0   NA     NA

That looks good. You could apply this same idea to various other R functions such as sd() or var(). You could also apply it to sum() as SPSS does, but I rarely do that. If you were creating a scale score from a set of survey Likert items measuring agreement and a person replied “strongly agree” (a value of 5), to only half the items but skipped the others, would you want the resulting score to be a neutral value as the sum would imply, or “strongly agree” as the mean would indicate? The mean makes much more sense in most situations. Be careful though as there are standardized tests that require use of the sum.

If you’re an SPSS user looking to learn just enough R to use the two together, you might want to read this, or to learn more you could take one of my workshops. If you really want to dive into the details, you might consider reading my book, R for SAS and SPSS Users.

R Workshops Updated to Include the Latest Packages

Two new R packages are quickly becoming standards in the R community:
Hadley Wickham’s dplyr and tidyr. The dplyr package almost completely replaces his popular plyr package for data manipulation. Most importantly for general R use, it makes it much easier to select variables. For example,

R workshop series presented at a major pharmaceutical company. Photography by Stephen Bernard.

if your data included variables for race, gender, pretest, posttest, and four survey items q1 through q4, you could select various sets of variables using:

library("dplyr")
select(mydata, race, gender) # Just those two variables.
select(mydata, gender:posttest)   # From gender through posttest.
select(mydata, contains("test"))  # Gets pretest & posttest.
select(mydata, starts_with("q"))  # Gets all vars staring with "q".
select(mydata, ends_with("test")) # All vars ending with "test".
select(mydata, num_range("q", 1:4)) # q1 thru q4 regardless of location.
select(mydata, matches("^q"))  # Matches any regular expression.

As I show in my books, these were all possible in R before, but they required much more programming.

The tidyr package replaces Hadley’s popular reshape and reshape2 packages with a data reshaping approach that is simpler and more focused just on the reshaping process, especially converting from “wide” to “long” form and back.

I’ve integrated dplyr in to my workshop R for SAS, SPSS and Stata Users, and both tidyr and dplyr now play extensive roles in my Managing Data with R workshop. The next Virtual Instructor-led Classroom (webinar) version of those workshops I’m doing in partnership with Revolution Analytics during the week of October 6, 2014. I’m also available to teach them at your organization’s site in partnership with RStudio.com (contact me at Muenchen.bob@gmail.com to schedule a visit). These workshops will also soon be available 24/7 at Datacamp.com. “You’ll be able to take Bob’s popular workshops using an interactive combination of video and live exercises in the comfort of your own browser” said Jonathan Cornelissen, CEO of Datacamp.com.

R Passes SPSS in Scholarly Use, Stata Growing Rapidly

by Robert A. Muenchen

[Since this was originally published in 2014, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]

Here is my latest update to The Popularity of Data Analysis Software. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:

As I covered in my talk at the UseR 2014 meeting, it is very likely that during the summer of 2014, R became the most widely used analytics software for scholarly articles, ending a spectacular 16-year run by SPSS.
Stata has probably passed Statistica in scholarly use, and its rapid rate of growth parallels that of R.

If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.

Scholarly Articles

The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.

Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.

From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.

Figure 2a. Number of scholarly articles found for each software.

Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.

Figure 2b. Number of scholarly articles found for the top five classic statistics packages.

Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2015 (correction: this said 2014 when originally posted).

After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”

Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.

Fig_2c_ScholarlyImpactBig6 — Figure 2c. Number of scholarly articles that reference each software by year, after removing the top two, SPSS and SAS, and adding the next two most popular, Systat and JMP.

Knoxville R Users’ Group Meets September 3rd

The Knoxville R Users Group (KRUG) is hosting a brown bag viewing of RStudio’s webinar “Interactive Reporting” at 11am, Weds 3-Sept-2014 in 427 Hesler on the UTK campus “Hill” . Per RStudio.net, data scientist Garrett Grolemund and software engineer Joe Cheng will speak on how to make your R Markdown documents interactive, and then unleash the full flexibility of analytic app development with shiny. Come join us!

jpmml and R (Free Webinar)

This free, global webinar will provide an introduction to jpmml, the world’s leading open-source PMML scoring engine currently being utilized by companies such as Airbnb to rapidly deploy predictive models into production.

Webinar Format:
– What is PMML?
– Building a predictive model in R and exporting it to PMML format
– Deploying a PMML model into a cloud-based Openscoring service
– Scoring Google Spreadsheet data
– Scoring PostgreSQL data
– Scoring Hadoop data
– Q&A

Speaker:
– Villu Ruusmann, Creator of jpmml and Founder of Openscoring.io

This event is brought to you by The Orange County R User Group.

Registration:
https://www3.gotomeeting.com/register/695356990

The RSelenium R Package (Free Webinar)

The Orange County R User Group (OC-RUG) will soon host a free webinar on the new “RSelenium” R Package which provides a set of R bindings for the Selenium 2.0 webdriver using the JsonWireProtocol. Using RSelenium, you can automate web browsers locally or remotely in order to test various web apps, such as Shiny applications – for example.

Webinar Format:
– Introduction to the RSelenium R package
– Live Demonstration
– Question and Answer period

Date: May 21, 2014 at 10 am Pacific (California) time
Speaker:
John Harrison, RSelenium package author/maintainer

For more information on the RSelenium package, please visit this site:

http://cran.r-project.org/web/packages/RSelenium

Please note that in addition to attending from your laptop or desktop computer, you can also attend from a Wi-Fi connected iPhone, iPad, Android phone or Android tablet by installing the GoToMeeting App.

Registration is below:

https://www3.gotomeeting.com/register/724626654

JMBayes R package (webinar)

A free webinar will provide an introduction to the “JMBayes” R package which provides methods for Joint Modeling of Longitudinal and Time-to-Event Data under a Bayesian Approach.

Webinar Format:

– Introduction to Joint Models and the JMBayes R package
– Live demonstration
– Question and Answer period

Speaker:

– Dimitris Rizopoulos, JMBayes Package Maintainer

For more information on the JMBayes package, please visit this site:

http://cran.r-project.org/package=JMbayes

Registration:

https://www3.gotomeeting.com/register/187219462

This event is brought to you by The Orange County R User Group.

R Continues Its Rapid Growth

I’ve just updated the section below from The Popularity of Data Analysis Software. Note that the overall article is still under construction and all the figure numbers have changed from previous versions.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.

Figure 8 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.998). The right-most point is for version 3.0.2, the last version released in 2013.

Figure 8. Number of R packages plotted for each major release of R.

As rapid as this growth has been, these data represent only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 4/7/2014 counted 7,364 R packages at all major repositories, 5,323 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 38% higher on the y-axis than the one shown in Figure 8. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contains around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 36,820 functions compared to SAS’s 1,200. In fact, during 2013 alone, R added more functions/procs than SAS Institute has written in its entire history! That’s 835 packages, counting only CRAN, or around 4,175 functions. Of course these are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do. However, R functions can nest inside one another, creating nearly infinite combinations. Also, SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would be happy to list it here. While the comparison is not perfect, it does provide an interesting perspective on the size and growth rate of R.

Learn to Manage Data at useR! 2014 or Online April 25

Before you can analyze data, it must be in the right form. Join me on April 25th for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr, lubridate and sqldf. I’ll also be presenting a 3-hour version at the UseR! 2014 conference. Here’s a list of the topics covered:

Transformation basics
Conditional transformations
Summarization of columns and rows
Summarization by group
Analysis by group
Sorting data
Selecting first or last observation per group
Miscellaneous variable tools (rename, keep, drop)
Stacking data frames
Finding and removing duplicate observations
Merging data frames
Reshaping data frames
Character string manipulations
Date / time manipulations (not in shorter useR! presentation)
Using SQL within R (not in shorter useR! presentation)

Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!