How to Search for Data Science Articles

By Robert A. Muenchen

This article describes the technical details of how to search for scholarly articles in the field of data science. The goal is not to optimize the search for a particular article, but rather to find all articles that either use or write about a particular software package. Counts of these articles are then used to estimate the market share of each package. The results are displayed and discussed in The Popularity of Data Analysis Software.

Overview

Here are the steps I use to search at Scholar.Google.com in brief:

1. For software with non-ambiguous names, simply search on their names in quotes. Often the quotes are not needed, but it can be difficult to determine when they are. For example, Tibco’s Spotfire program has a very unique name, but Google Scholar considers articles about firefighting that include the separate terms “spot fire” to be equivalent to “spotfire” unless you enclose the name in quotes.

2. Scholarly papers are supposed to cite the software they use in a standard way. For example, the use of SAS should reference the vendor, “SAS Institute”. To verify how well such a citation works, it’s good to search for its opposite. For example, the search string: “SPSS” -“SPSS, Inc.” will exclude the vendor (the minus sign “-” excludes) but still find hits relevant to the SPSS Statistics product. While a similar search for SAS: “SAS” -“SAS Institute” will mostly consist of irrelevant hits including many authors whose first or last name happen to be “Sas”. The initials “SAS” are also the equivalent to “Inc.” in Spanish.

Searching by vendor name is very helpful, but authors don’t always cite their software. Surprisingly, authors occasionally cite the vendor but not the package. That is the case for Statsoft’s Statistica. Statistica means “statistics” in Italian perhaps leading authors to often use statements like, “we used the statistics package from Statsoft.” Now that Dell has purchased Statsoft, this will cause trouble as “Dell” is not a useful search term by itself.

3. Some software have add-on packages that are so well known that the main package may not be mentioned at all. For example, R’s ggplot2 package may be cited without reporting that R itself was used.

4. Some software names, especially Fico, Java, Python, SAS, and Scala are also common names of people and/or geographic locations. Authors can be excluded from Google Scholar searches with “-author:java”. Unfortunately, this exclusion applies to the author field, not the references at the end of a paper. That means that the counts on these packages are inflated unless the search string specifically excludes that possibility.

5. General purpose programming languages are cited most often for tasks that have nothing to do with data science. Adding inclusion terms helps focus the search. Examples include categories such as “machine learning” and specific methods, such as “regression analysis.” These search terms were added only in English, so the results are underestimates of the true data science usage of this type of software.

To make the comparisons among packages most equitable, it would be ideal to include the same set of inclusion terms for all the software studied here. However, that would mean under-counting the use of the special-purpose software, which I prefer to avoid. As a simplified example, I don’t search for “SAS and regression” but I do search for “Java and regression”. The actual, more complex, searches are below.

6. Regarding logic, Google uses a blank between terms to represent logical “AND” (the plus sign is no longer accepted for this purpose). To perform logical “or”, you must type “OR” in capital letters, or Google will search for the word “or”! Parentheses prioritize the order of the logic as usual.

7. Regarding special characters, Google Scholar ignores them. For example “H2O.ai” is the name of the company supporting the data science software, H2O. But Google will ignore the period and find references to water and artificial intelligence that have nothing to do with that company.

Software Names and Their Search Terms

Here is a list of the actual search terms I used for each piece of software.

Actuate: "Actuate BIRT"

Alpine:  "Alpine Data Labs"

Alteryx: "Alteryx"

Angoss:   "Angoss"

Azure Machine Learning: "Azure Machine Learning"

BMDP:    "BMDP" -marrow 
[removing Bone Marrow Donation Program]

C++ / C#: 
("C++" OR "C#") ("statistical analysis" OR 
"t test" OR "regression analysis" OR 
"quantitative analysis" OR 
"data analytics" OR "machine learning" OR 
"artificial intelligence" OR 
"analysis of variance" OR "anova" OR 
"chi square" OR "data mining")

Enterprise Miner: "Enterprise Miner"

FICO: [Not tracking this as I've found it impossible to
separate the data science from the credit checking even
using inclusion factors.]

H2O: Oxdata H20
[Note: This is a really difficult search to do any other way.
 But Oxdata is changing its name to H2O.ai, which is a VERY
 difficult search due to the prevalence of both "H2O" and "ai" 
 in research papers.]

Hadoop: "Hadoop"

Infocentricity: "Infocentricity"
[They were purchased by Fico, so I'm dropping coverage]
[Note: do not add "OR Xeno", which is useful when 
 searching for jobs. It adds a great deal of ambiguity 
 to this type of search.]

Java: 
java -author:java -weka -"Practical Machine Learning" 
-indonesia ("statistical analysis" OR "t test" OR 
 "regression analysis" OR "quantitative analysis" OR 
 "data analytics" OR "machine learning" OR 
 "artificial intelligence" OR "analysis of variance" OR 
 "anova" OR "chi square" OR "data mining")

JMP: "JMP" "SAS Institute"

Julia: 
"Julia: A Fast Dynamic Language for Technical Computing"

KNIME: KNIME

KXEN:  KXEN

Lavastorm: "Lavastorm"

MATLAB:  
"MATLAB" ("statistical analysis" OR "t test" OR 
"regression analysis" OR "quantitative analysis" OR 
"data analytics" OR "machine learning" OR 
"artificial intelligence" OR "analysis of variance" OR 
"anova" OR "chi square" OR "data mining")

Megaputer: "Megaputer" OR "Polyanalyst"

Minitab: "Minitab" 

NCSS: "Number Cruncher Statistical System"
[Cannot use "NCSS" for this as it stands for over 
 15 organizations]

Pentaho: "Pentaho" 

PolyAnalyst: "PolyAnalyst" 

Prognoz: "Prognoz Platform"
[Note: Prognoz means forecast in Polish.]

Python: 
python -author:python -snake 
("statistical analysis" OR "t test" OR 
 "regression analysis" OR 
 "quantitative analysis" OR "data analytics" OR 
 "machine learning" OR "artificial intelligence" OR 
 "analysis of variance" OR "anova" OR "chi square" OR 
 "data mining") 

R: 
"r-project.org" OR "R development core team" OR "lme4" OR 
"bioconductor" OR "RColorBrewer" OR "the R software" OR 
"the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR 
"knitr" OR "RODBC" OR "stringr" OR "mass package"

RapidMiner: "RapidMiner" 

Revolution Analytics: "Revolution Analytics"
[Note: Merged with Microsoft so keywords are uncertain.]

Salford Systems: "Salford Systems" 

SAP: "SAP" "KXEN"

SAS: 
"SAS Institute" -JMP -"Enterprise Miner"
[Note: This under counts SAS slightly but I haven't found
 a way around the problem given that "Sas" is a popular 
 first and last name for authors. Also, in Spanish, 
 "S.A.S." is the equivalent of "Inc." in English 
(Sociedad por acciones simplificadas.)

SAS Enterprise Miner: "Enterprise Miner"

Scala: "Scala language" OR "language Scala"
[That's cutting it a lot of slack compared to Java or Python.
 It finds it for all uses like Julia. If it gets more popular
 in the future, I'll add data science terms to the search.]

Spark: "Apache Spark"

Spotfire "Spotfire" -fire -burn
[I've stopped collecting this data.]

SPSS: SPSS -"SPSS Modeler" -"Amos"
[The letters "SPSS" stand for only a few other rare topics
 that I estimate results in over-counting by only 0.28%.]

SPSS Modeler: "SPSS Modeler" 

Stata: 
("stata" "college station") OR "StataCorp" OR "Stata Corp" OR 
"Stata Journal" OR "Stata Press" OR "stata command" OR 
"stata module" 
[Note: "stata" means "was" in Spanish, to by itself,
 you'll get a greatly inflated number of hits!]

Statgraphics: "Statgraphics" 

Statistica: "Statsoft" 

Systat: "Systat" 

Tableau: 
"Tableau Software" OR "Tableau Desktop" OR 
"Tableau Online" OR "Tableau Server" 
[Don't include "Tableau Public", it's a common French term.] 

Tibco: "Tibco Spotfire" OR "Tibco TERR" OR "Tibco Enterprise"

WEKA: 
WEKA ("machine learning" OR "data mining")
[Note: The following search string used in previous years
(before March, 2015) under-counted.]
"WEKA Data Mining" OR 
"Waikato Environment for Knowledge Analysis"]

Inclusion Terms

While many of the packages are clearly focused on data science, the more general purpose ones — C++, C#, Java and Python — are not. So to determine the best way to focus the searches, I compiled a list of relevant terms commonly used in scholarly papers, then I searched for documents that included them, one at a time. I counted the number of documents for each term and tracked how likely it was to result in an accurate hit. The latter was done using the time honored, “I know it when I see it” approach. (I’m quite familiar with advanced text analytics, but I don’t have time to extract all the data and do it.) The items marked with a “*” below show the terms used. These counts were collected on 5/11/2014, but given that I was searching across all years, the prevalence of the various terms is likely to shift slowly as time passes.

   Search Terms               Number of Articles
Survey  (not well focused)       5,300,000 
Statistical (not well focused)   4,860,000 
Statistics (not well focused)    4,770,000 
Statistical analysis *           3,670,000 
t test *                         3,480,000
Regression analysis *            2,920,000 
Linear regression                2,650,000 
Quantitative analysis *          2,570,000 
Data analytics *                 2,380,000 
Machine learning *               1,740,000 
Artificial intelligence *        1,720,000 
Analysis of variance             1,570,000
Chi square                       1,490,000
ANOVA                            1,340,000 
Survey research                  1,230,000 
Data mining *                    1,210,000 
Statistical software *           1,120,000 
logistic regression              1,080,000 
nonparametric                      800,000 
Analytics (not well focused)       519,000 
Statistical package                347,000 
Decision trees                     169,000 
Business intelligence              146,000 
Statistical modeling *             145,000 
Analyze data (not well focused)    125,000 
Big data *                          51,700 
Predictive modeling *               39,400 
Predictive analytics *               9,540 
Business analytics *                 7,660 
Advanced analytics                   3,700

I’m very interested in improving this methodology so if you have ideas, please comment below or send me email at muenchen.bob@gmail.com.