*by Bob Muenchen*

Analytics tools take significant effort to master, so once learned people tend to stick with them for much of their careers. This makes the tools used in academia of particular interest in the study of future trends of market share. I’ve been tracking The Popularity of Data Analysis Software regularly since 2010, and thanks to an astute reader, I now have a greatly improved estimate of Stata’s academic growth. Peter Hedström, Director of the Institute for Analytical Sociology at Linköping University, wrote to me convinced that I was underestimating Stata’s role by a wide margin, and he was right.

Two things make Stata’s popularity difficult to guage: 1) Stata means “been” in Italian, and 2) it’s a common name for the authors of scholarly papers and those they cite. Peter came up with the simple, but very effective, idea of adding Statacorp’s headquarter, College Station, Texas, to the search. That helped us find far more Stata articles while blocking the irrelevant ones. Here’s the search string we came up with:

("Stata" "College Station") OR "StataCorp" OR "Stata Corp" OR "Stata Journal" OR "Stata Press" OR "Stata command" OR "Stata module"

The blank between Stata and College Station is an implied logical “and”. This string found 20% more articles than my previous one. This success motivated me to try and improve some of my other search strings. R and SAS are both difficult to search for due to how often those letters stand for other things. I was able to improve my R search string by 15% using this:

"r-project.org" OR "R development core team" OR "lme4" OR "bioconductor" OR "RColorBrewer" OR "the R software" OR "the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR "knitr" OR "RODBC" OR "stringr" OR "mass package"

Despite hours of effort, I was unable to improve on the simple SAS search string of “SAS Institute.” Google Scholar’s logic seems to fall apart since “SAS Institute” OR “SAS procedure” finds fewer articles! If anyone can figure that out, please let me know in the comments section below. As usual, the steps I use to document all searches are detailed here.

The improved search strings have affected all the graphs in the Scholarly Articles section of The Popularity of Data Analysis Software. At the request of numerous readers, I’ve also added a log-scale plot there which shows the six most popular classic statistics packages:

If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop,

R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do training on-site but I’m often booked about 8 weeks out.

I invite you to follow me on this blog and on Twitter.

It is interesting that Stata is gaining such momentum give R’s cost and unmatched power… No bias here…

Hi Onlyrey,

If you’re implying that this is biased, why not prove it with data? It’s very easy to get the data.

Cheers,

Bob

Hi Bob, n=1 says R is better AND by saying there is ‘no bias here’, I’m implying that I am biased. Therefore, there is no amunt of data that can prove wrong;) I use and like R, but on the other hand, I am also an Aggie and cheer when College Station shows up anywhere around my biased lenses.

Oh I get it now. I thought you were being sarcastic. I have to agree, R and Stata have so many attributes in common that with R being free, I’m totally surprised by this outcome! Two advantages contribute to this situation I think: Stata has a nice GUI that makes it approachable by occasional users, and it has none of the problems that I describe in Why R is Hard to Learn.

Most users of statistical packages don’t pay for the software, their employer does, and often at a fairly reasonable cost. So, the opportunity cost of using one package vs. another is more about the learning and user experience than the actual price. STATA has substantial penetration in certain disciplines (e.g. economics), has made substantial improvements in documentation and GUI interface over the last few years, and the switching costs are simply difficult to overcome. Also, it is unfortunately easy to pirate STATA so price really has little to do with it.

It is interesting that Stata is gaining such momentum give R’s cost and unmatched power… No bias here…

Hi Onlyrey,

If you’re implying that this is biased, why not prove it with data? It’s very easy to get the data.

Cheers,

Bob

Hi Bob, n=1 says R is better AND by saying there is ‘no bias here’, I’m implying that I am biased. Therefore, there is no amunt of data that can prove wrong;) I use and like R, but on the other hand, I am also an Aggie and cheer when College Station shows up anywhere around my biased lenses.

Oh I get it now. I thought you were being sarcastic. I have to agree, R and Stata have so many attributes in common that with R being free, I’m totally surprised by this outcome! Two advantages contribute to this situation I think: Stata has a nice GUI that makes it approachable by occasional users, and it has none of the problems that I describe in Why R is Hard to Learn.

Most users of statistical packages don’t pay for the software, their employer does, and often at a fairly reasonable cost. So, the opportunity cost of using one package vs. another is more about the learning and user experience than the actual price. STATA has substantial penetration in certain disciplines (e.g. economics), has made substantial improvements in documentation and GUI interface over the last few years, and the switching costs are simply difficult to overcome. Also, it is unfortunately easy to pirate STATA so price really has little to do with it.

“Cran” is usually cited too, as most packages come from there. The combination of cran and R is also very unique.

Here’s how I work toward finding the optimal search string: choose all the single terms I can think of such as “CRAN”, “R-Project”, etc. and search on them one at a time in the most recent complete year (2014). Then I sort them by how successful they are and start testing combinations. Often I’ll find that the top 4 or 5 overlap so much that only one is needed in the final search string. Google Scholar limits the length of the search string so you can’t just keep adding terms. In fact, even when a term will fit into its limit, for inexplicable reasons, sometimes adding a term with “OR” causes Google Scholar to find fewer papers! That makes no sense at all. I invite you to find a search string that can find more articles than mine and report it here.

Cheers,

Bob

“Cran” is usually cited too, as most packages come from there. The combination of cran and R is also very unique.

Here’s how I work toward finding the optimal search string: choose all the single terms I can think of such as “CRAN”, “R-Project”, etc. and search on them one at a time in the most recent complete year (2014). Then I sort them by how successful they are and start testing combinations. Often I’ll find that the top 4 or 5 overlap so much that only one is needed in the final search string. Google Scholar limits the length of the search string so you can’t just keep adding terms. In fact, even when a term will fit into its limit, for inexplicable reasons, sometimes adding a term with “OR” causes Google Scholar to find fewer papers! That makes no sense at all. I invite you to find a search string that can find more articles than mine and report it here.

Cheers,

Bob

I know many people who work with Stata – it’s fairly common in econonomics – different languages are used by different communities.

Regarding SAS hits, another annoyance coming from the Italian language potentially inflating the number of SAS hits is that SAS is a type of company in Italy, so you could find some SAS that are simply Italian companies http://it.wikipedia.org/wiki/Societ%C3%A0_in_accomandita_semplice

Hi Geuzeke,

That’s an excellent point. If you’re in economics, public heath, demography, etc. then you’d do well to learn Stata since it’s very dominant in those areas.

I knew S.A.S. stands for “Inc.” or incorporated, in Spanish, but I didn’t know that it meant that in Italian too. Thanks for pointing that out!

Cheers,

Bob

I know many people who work with Stata – it’s fairly common in econonomics – different languages are used by different communities.

Regarding SAS hits, another annoyance coming from the Italian language potentially inflating the number of SAS hits is that SAS is a type of company in Italy, so you could find some SAS that are simply Italian companies http://it.wikipedia.org/wiki/Societ%C3%A0_in_accomandita_semplice

Hi Geuzeke,

That’s an excellent point. If you’re in economics, public heath, demography, etc. then you’d do well to learn Stata since it’s very dominant in those areas.

I knew S.A.S. stands for “Inc.” or incorporated, in Spanish, but I didn’t know that it meant that in Italian too. Thanks for pointing that out!

Cheers,

Bob

Thank you for the insight. I use R and Stata. R is very handy for preparing the raw data for analysis (80 % of my job), and RStudio deserves special mention here because it makes easy for new R users to adapt R (I was surprised to see that R-Studio was not explicitly included in search term). I was also surprised to see that data.table package was not included in search term. As a regular StackOverflow user, I can say, without any doubt, that data.table package has become standard in R just like dplyr (or plyr) or rcpp or ggplot. I deal with data consisting of millions of observations (an emerging trend), and without data.table package, this would not be possible in R. I still use Stata because I need to use advance econometric techniques, and I think R does have packages for these too but the documents are not very much helpful for me to switch to R. But, I think this trend will change in the near future.

Hi David,

Unfortunately data.table would not be a good search term because Google Scholar ignores punctuation. So it would find any article that mentioned the use of a “data table”. RStudio has potential though. I use it myself but I didn’t test it since I figured people would be unlikely to mention what editor (or IDE) they used.

Thanks,

Bob

Thank you for the insight. I use R and Stata. R is very handy for preparing the raw data for analysis (80 % of my job), and RStudio deserves special mention here because it makes easy for new R users to adapt R (I was surprised to see that R-Studio was not explicitly included in search term). I was also surprised to see that data.table package was not included in search term. As a regular StackOverflow user, I can say, without any doubt, that data.table package has become standard in R just like dplyr (or plyr) or rcpp or ggplot. I deal with data consisting of millions of observations (an emerging trend), and without data.table package, this would not be possible in R. I still use Stata because I need to use advance econometric techniques, and I think R does have packages for these too but the documents are not very much helpful for me to switch to R. But, I think this trend will change in the near future.

Hi David,

Unfortunately data.table would not be a good search term because Google Scholar ignores punctuation. So it would find any article that mentioned the use of a “data table”. RStudio has potential though. I use it myself but I didn’t test it since I figured people would be unlikely to mention what editor (or IDE) they used.

Thanks,

Bob

I use only R nowadays. I used STATA and MATA previously. Switching to R wasn’t that hard for me because I used C++ and Fortran in the past. But, the question like this:

http://stackoverflow.com/questions/30236487/replicating-stata-probit-with-robust-errors-in-r

on stack exchange shows whether we are over trusting the results from STATA. I trust R because it is transparent, and if I want, I can change code easily.

Hi Yun Wen,

I agree that it’s important to be able to see and understand every aspect of what the program is doing. That’s a big advantage of R.

Cheers,

Bob

I use only R nowadays. I used STATA and MATA previously. Switching to R wasn’t that hard for me because I used C++ and Fortran in the past. But, the question like this:

http://stackoverflow.com/questions/30236487/replicating-stata-probit-with-robust-errors-in-r

on stack exchange shows whether we are over trusting the results from STATA. I trust R because it is transparent, and if I want, I can change code easily.

Hi Yun Wen,

I agree that it’s important to be able to see and understand every aspect of what the program is doing. That’s a big advantage of R.

Cheers,

Bob