So what happened? We’re looking back across many years, so while it’s possible that SPSS suddenly became much more popular in 2014, that could not account for lifting the whole trend line. It’s possible Google Scholar improved its algorithm to find articles that existed previously. It’s also possible that new journal archives have opened themselves up to being indexed by Google. Why would it affect SPSS more than SAS or R? SPSS is menu-driven so it’s easy to install with its menus and dialog boxes translated into many languages. Since SAS and R are much more frequently used via their English-based languages, they may not be as popular in non-English speaking countries. Therefore, one might see a disproportionate impact on SPSS by new non-English archives becoming available. If you have an alternate hypothesis, please leave it in the comments below.

The remainder of this post is the complete updated section on this topic from The Popularity of Data Analysis Software:

**Scholarly Articles**

The more popular a software package is, the more likely it will appear in scholarly publications as a topic or as a tool of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2014. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. The software from Java through Statgraphics show a slow decline in usage from highest to lowest. Note that the general purpose software C, C++, C#, MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest.

From RapidMiner on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 825 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. RapidMiner, KNIME, SPSS Modeler and SAS Enterprise Miner are packages that all use the powerful and easy-to-use workflow interface, but their use has not yet caught on among scholars. BMDP is one of the oldest packages in existence. Its use has been declining for many years, but it’s still hanging in there. The software in the bottom half of this figure contain the newcomers, with the notable exception of Megaputer, whose Polyanalyst software has been around for many years now.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2c I’ve plotted the same scholarly-use data for 1995 through 2014, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).

Since SAS and SPSS dominate the vertical space in Figure 2c by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP, with the result shown in Figure 2d. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. If the current trends continue, the use of R may pass that of SPSS and SAS by the end of 2016. Note that current trends have shifted before as discussed here.

Stata has moved into fourth place, crossing above Statistica in 2014. The growth in the use of Stata is more rapid than all the classic statistics packages except for R. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another. [Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”]

I’ll announce future update on Twitter, where you can follow me as @BobMuenchen.

]]>

**Survey Link: www.rexeranalytics.com/Data-Miner-Survey-2015-Intro.html**

**Access Code: ****R8M4E2**** **

Survey results will be unveiled at the Fall-2015 Boston Predictive Analytics World event.

Rexer Analytics has been conducting the Data Miner Survey since 2007. Each survey explores the analytic behaviors, views and preferences of data miners and analytic professionals. Over 1,200 people from around the globe participated in the 2013 survey. Summary reports (40 page PDFs) from previous surveys are available FREE to everyone who requests them by emailing DataMinerSurvey@RexerAnalytics.com. Also, highlights of earlier Data Miner Surveys are available at www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html, including best practices shared by respondents on analytic success measurement, overcoming data mining challenges, and other topics. The FREE Summary Report for this 2015 Data Miner Survey will be available to everyone Fall-2015.

Please help spread the word.

*Rexer Analytics is a consulting firm focused on providing data mining and analytic CRM solutions. Recent solutions include customer loyalty analyses, customer segmentation, predictive modeling to predict customer attrition and to target direct marketing, fraud detection, sales forecasting, market basket analyses, and complex survey research. More information is available at www.RexerAnalytics.com or by calling +1 617-233-8185.*

]]>

As the R programming environment has grown in capability and popularity, so have the number of organizations planning to migrate to it from proprietary tools. I’ve helped members of various organizations transition from SAS, SPSS and/or Stata to R (see Workshop Participants), and the process typically involves the following steps:

1) Begin with the most important question: who should you migrate to R? Learning R is not a trivial task (see Why R is Hard to Learn). However, once mastered by people who use it regularly, I think it’s easier to use than other software. But if you have some people who use something like SAS only occasionally and view it as hard to use, you might consider getting them something other than R. Menu-based solutions such as SPSS or

R Commander may be a better fit for them. If they want to continue using SAS while lowering your licensing costs, you might consider the SAS implementation used in WPS (see World Programming).

2) Motivate people to migrate. Discussing your current software budget may help. Showing your staff the growth of R’s capabilities and popularity may also help (see The Popularity of Data Analysis Software). Keep in mind that attempting to motivate people to change by criticizing their current choice is likely to backfire. People’s choice of software is very personal and criticizing it is like telling them they have the wrong religion.

3) Use training & documentation that leverages what they already know, that speaks their language. A trainer who knows both your existing environment and R can convert what the analysts currently know rather than simply starting from scratch (note that this self-serving advice!). There are two parts to this process: learning the new R code and learning to interpret the new R output. Choosing to use R packages that provide output similar to your current software choice will help smooth the transition. Good sources for training are listed here: Training and Consulting Partners – RStudio and here: R for SAS, SPSS and STATA Users | DataCamp. Books that help with the conversion process include:

R for SAS and SPSS Users, Muenchen

SAS and R, Kleinman & Horton

R for Stata Users, Muenchen & Hilbe

R Through Excel, Heiberger & Neuwirth (for those who use the SAS Excel plug-in)

4) Provide in-house tech support. Before training a whole team, get one in-house expert trained to act as a consultant to others. Make sure this person is well known by everyone and has time freed up to provide help.

5) Match your staff’s current work style, work flow and output. This is a particularly complex topic. Some examples: if your people are running SAS from the Excel plug-in, get the the R plug-in; if they’re using Enterprise Miner, consider a similar interface that controls R such as the KNIME Analytics Platform. If Microsoft Word is their main word processor, don’t complicate the conversion by switching them to LaTeX text processor at the same time (LaTeX is wonderful and very popular among R users, but it’s a mess to learn that and R at the same time). Instead, use an approach that generates Word output.

6) Migrate one step at a time if possible. For example, if you use SAS/ETS for forecasting, consider replacing just that one piece. When finished and successful, choose the next product, saving SAS/Base for last.

7) Convert your programs or use conversion services. If your programs are all in production, this could be a huge job. However, if you mostly use SAS for new research tasks, you may not need to convert old code from which you just needed a solution. Be careful to avoid line-by-line conversion; think like R (e.g. avoid for- and while-loops in R). When using external conversion services, make sure to involve your own staff in the process so you don’t end up with code that’s almost impossible to maintain.

I have found that following these steps helps during conversions to R. It’s a big job, though, so allocate plenty of time to it. Good luck!

]]>

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about

260,000times in 2014. If it were an exhibit at the Louvre Museum, it would take about 11 days for that many people to see it.

Click here to see the complete report.

]]>

Each year, the Gartner Group, “the world’s leading information technology research and advisory company”, collects data in a survey of the customers of 42 business intelligence firms. They recently released the data on the customers’ plans to discontinue use of their current software in one to three years. The results are shown in the figure below. Over 16% of the SAS Institute customers surveyed reported considering discontinuing their use of the software, the highest of any of the vendors shown. It will be interesting to see if this will actually lead to an eventual decline in revenue. Although I have helped quite a few organizations migrate from SAS to R, I would be surprised to see SAS Institute’s revenue decline. They offer excellent software and service which I still use, though not anywhere near as much as R.

The full Gartner report is available here.

]]>

My next two live webinars done in partnership with Revolution Analytics are in January:

R for SAS, SPSS and Stata Users

Managing Data with R (updated to include dplyr, broom, tidyr, etc.)

Course outlines and registration for both is here.

My R for SAS, SPSS and Stata Users workshop is also now available as a self-paced interactive video workshop at DataCamp.com.

I do site visits in partnership with RStudio.com, whose software I recommend and use in every form of my training. If your company does its training through Xerox Learning Services, I also partner with them. For further details or to arrange a site visit, you can reach me at muenchen.bob@gmail.com.

]]>

]]>

MEAN.5(Q1 TO Q10) asks for the mean only if at least five of the ten variables have valid values. Otherwise the result will be a missing value. This “.n” extension is also available for SPSS’ SUM, SD, VARIANCE, MIN and MAX functions.

Let’s now take a look at how to do this in R. First we’ll create some data with different numbers of missing values for each observations.

> q1 <- c(1, 1, 1) > q2 <- c(2, 2, NA) > q3 <- c(3, NA, NA) > df <- data.frame(q1, q2, q3) > df q1 q2 q3 1 1 2 3 2 1 2 NA 3 1 NA NA

R already has a mean function, but it lacks a function to count the number of valid values. A common way to do this in R is to use the is.na() function to generate a vector of TRUE/FALSE values for missing or not, respectively, then sum them. As with many software packages, R views TRUE as having the value 1 and FALSE as having a value of 0, so this approach gets us the number of missing values. The “!” symbol means “not” in R so !is.na() will find the number *non*-missing values. Here’s a function that does this:

> nvalid <- function(x) sum(!is.na(x)) > nvalid(q2) [1] 2

So it has found that there are two valid values for q2. This nvalid() function obviously works on vectors, but we need to apply it to the rows of our data frame. We can select the first three variables using df[1:3] and then pass the result into as.matrix() to make the rows easily accessible by R’s apply() function. The apply() function’s second argument is 1 indicating that we would like to compute the mean across rows (the value 2 would indicate columns). The final arguments are the functions to apply and any arguments they need.

> means <- apply(as.matrix(df[1:3]), 1, mean, na.rm = TRUE) > counts <- apply(as.matrix(df[1:3]), 1, nvalid) > means [1] 2.0 1.5 1.0 > counts [1] 3 2 1

We have our means and the counts of valid values, so all that remains is to choose our desired value of counts and accept the mean if the data have that value or greater, but return a missing value (NA) if not. This can be done using the ifelse() function, whose first argument is the logical condition, followed by the value desired when TRUE, then the value when false.

> means <- ifelse(counts >= 2, means, NA) > means [1] 2.0 1.5 NA

We’ve seen all the parts work, so all that remains is to put them together into a single function that has two arguments, one for the data frame and one for the n required.

mean.n <- function(df, n) { means <- apply(as.matrix(df), 1, mean, na.rm = TRUE) nvalid <- apply(as.matrix(df), 1, function(df) sum(!is.na(df))) ifelse(nvalid >= n, means, NA) }

Let’s test our function requiring 1, 2 and 3 valid values.

> df$mean1 <- mean.n(df[1:3], 1) > df$mean2 <- mean.n(df[1:3], 2) > df$mean3 <- mean.n(df[1:3], 3) > df q1 q2 q3 mean1 mean2 mean3 1 1 2 3 2.0 2.0 2 2 1 2 NA 1.5 1.5 NA 3 1 NA NA 1.0 NA NA

That looks good. You could apply this same idea to various other R functions such as sd() or var(). You could also apply it to sum() as SPSS does, but I rarely do that. If you were creating a scale score from a set of survey Likert items measuring agreement and a person replied “strongly agree” (a value of 5), to only half the items but skipped the others, would you want the resulting score to be a neutral value as the sum would imply, or “strongly agree” as the mean would indicate? The mean makes much more sense in most situations. Be careful though as there are standardized tests that require use of the sum.

If you’re an SPSS user looking to learn just enough R to use the two together, you might want to read this, or to learn more you could take one of my workshops. If you really want to dive into the details, you might consider reading my book, R for SAS and SPSS Users.

]]>

Hadley Wickham’s dplyr and tidyr. The dplyr package almost completely replaces his popular plyr package for data manipulation. Most importantly for general R use, it makes it much easier to select variables. For example,

if your data included variables for race, gender, pretest, posttest, and four survey items q1 through q4, you could select various sets of variables using:

library("dplyr")select(mydata, race,gender) # Just those two variables.select(mydata,gender:posttest) # From gender through posttest.select(mydata, contains("test")) # Gets pretest & posttest.select(mydata,starts_with("q")) # Gets all vars staring with "q".select(mydata,ends_with("test")) # All vars ending with "test".select(mydata,num_range("q",1:4)) # q1 thru q4 regardless of location. select(mydata, matches("^q")) # Matches any regular expression.

As I show in my books, these were all possible in R before, but they required much more programming.

The tidyr package replaces Hadley’s popular reshape and reshape2 packages with a data reshaping approach that is simpler and more focused just on the reshaping process, especially converting from “wide” to “long” form and back.

I’ve integrated dplyr in to my workshop R for SAS, SPSS and Stata Users, and both tidyr and dplyr now play extensive roles in my Managing Data with R workshop. The next Virtual Instructor-led Classroom (webinar) version of those workshops I’m doing in partnership with Revolution Analytics during the week of October 6, 2014. I’m also available to teach them at your organization’s site in partnership with RStudio.com (contact me at Muenchen.bob@gmail.com to schedule a visit). These workshops will also soon be available 24/7 at Datacamp.com. “You’ll be able to take Bob’s popular workshops using an interactive combination of video and live exercises in the comfort of your own browser” said Jonathan Cornelissen, CEO of Datacamp.com.

]]>

Here is my latest update to *The Popularity of Data Analysis Software*. To save you the trouble of reading all 25 pages of that article, the new section is below. The two most interesting nuggets it contains are:

- As I covered in my talk at the UseR 2014 meeting, it is very likely that during the summer of 2014, R became the most widely used analytics software for scholarly articles, ending a spectacular 16-year run by SPSS.
- Stata has probably passed Statistica in scholarly use, and its rapid rate of growth parallels that of R.

If you’d like to be alerted to future updates on this topic, you can follow me on Twitter, @BobMuenchen.

**Scholarly Articles**

The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a good leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Analytics Articles. Since Google regularly improves its search algorithm, I recollect the data for all years following the protocol described at http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/.

Figure 2a shows the number of articles found for each software package for all the years that Google Scholar can search. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. Note that the general purpose software MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest. Neither C nor C++ are included here because it’s very difficult to focus the search compared to the search for jobs above, whose job descriptions commonly include a clear target of skills in “C/C++” and “C or C++”.

From RapidMiner on down, the counts appear to be zero. That’s not the case, but relative to the others, it might as well be.

Figure 2b shows the number of articles for the most popular six classic statistics packages from 1995 through 2013 (the last complete year of data this graph was made). As in Figure 2a, SPSS has a clear lead, but you can see that its dominance peaked in 2007 and its use is now in sharp decline. SAS never came close to SPSS’ level of dominance, and it peaked in 2008.

Since SAS and SPSS dominate the vertical space in Figure 2a by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP in Figure 2c. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. In fact, extending the downward trend of SPSS and the upward trend of R make it likely that sometime during the summer of 2014 R became the most dominant package for analytics used in scholarly publications. Due to the lag caused by the publication process, getting articles online, indexing them, etc. we won’t be able to verify that this has happened until well into 2015 (correction: this said 2014 when originally posted).

After R, Statistica is in fourth place and growing, but at a much lower rate. Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”

Extrapolating from the trend lines, it is likely that the use of Stata among academics passed that of Statistica fairly early in 2014. The remaining three packages, Minitab, Systat and JMP are all growing but at a much lower rate than either R or Stata.

]]>