How to Search for Data Science Articles

By Robert A. Muenchen, updated 12/30/2023

This article describes the technical details of how to search for scholarly articles in the field of data science. The goal here is not to optimize the search for a particular article, but rather to find all articles that either use or write about a particular software package. Counts of these articles are then used to estimate the market share of each package. Note that unlike job advertisements, scholarly articles nearly all contain data science terms. Therefore adding such terms to the search strings don’t help focus the search. The results are displayed and discussed in The Popularity of Data Science Software.

Overview

Here are the steps I use to search at Scholar.Google.com in brief:

1. Regarding logic, use a minus sign immediately before a search term (no blanks after it are allowed) to exclude it, such as -“SPSS Modeler” when counting only SPSS itself. Google uses a blank or the capitalized “AND” to represent logical “AND”. To perform logical “or,” you must type “OR” in capital letters, or Google will search for the word “or”!  Parentheses prioritize the order of the logic as usual.

2. Regarding special characters, Google Scholar ignores them. For example “H2O.ai” is the company’s name that supports the data science software, H2O. But Google will ignore the period and find references to water and artificial intelligence that have nothing to do with that company.

3. For software with non-ambiguous names, simply search on their names in quotes. Often the quotes are not needed, but it can be difficult to determine when they are. For example, Tibco’s Spotfire program has a unique name, but Google Scholar considers articles about firefighting that include the separate terms “spot fire” to be equivalent to “spotfire” unless you enclose the name in quotes.

4. Scholarly papers are supposed to cite the software they use in a standard way. For example, the use of SAS should reference the vendor, “SAS Institute”. To verify how well such a citation works, it’s good to search for its opposite. For example, the search string: “SPSS” -“SPSS, Inc.” will exclude the vendor but still find hits relevant to the SPSS Statistics product. While a similar search for SAS: “SAS” -“SAS Institute” will mainly consist of irrelevant hits, including many authors whose first or last name happens to be “Sas”. The initials “SAS” are also the equivalent to “Inc.” in Spanish.

Searching by vendor name is often helpful, but authors don’t always cite their software. Surprisingly, authors occasionally cite the vendor but not the package. That is the case for Statsoft’s Statistica. Statistica means “statistics” in Italian perhaps leading authors to often use statements like, “we used the statistics package from Statsoft.” That software was sold to Dell, then Tibco. Those companies do far more than sell Statistica, making a search on their names irrelevant.

5. Some software have add-on packages that are so well known that the main package may not be mentioned at all. For example, R’s ggplot2 package may be cited without reporting that R itself was used.

6. Some software names, especially FICO, Java, Python, SAS, and Scala are also common names for people, animals, geographic locations, or various other things. Authors can be excluded from Google Scholar searches with “-author:java”. Unfortunately, this exclusion applies to the author field, not the references at the end of a paper. That means the counts on these packages are inflated unless the search string explicitly excludes that possibility.

7. General-purpose programming languages are cited most often for tasks unrelated to data science. Adding inclusion terms helps focus the search. Examples include categories such as “machine learning” and specific methods such as “regression analysis.” These search terms were added only in English, so the results underestimate the true data science usage of this type of software.

To make the comparisons among packages most equitable, it would be ideal to include the same set of inclusion terms for all the software studied here. However, that would mean undercounting the use of the special-purpose software, which I prefer to avoid. As a simplified example, I don’t search for “SAS and regression” but I do search for “Java and regression”. The actual, more complex searches are below.

Software Names and Their Search Terms

Here is a list of the search terms I used for each piece of software.

Actuate: "Actuate BIRT"
[Purchased by OpenText in 2015, dropped coverage]

Alpine: "Alpine Data Labs"
[Acquired by Tibco in 2017; dropped coverage]

Altair: "Altair Knowledge Seeker" OR "Altair Knowledge Works"

Alteryx: "Alteryx"

Anaconda: Anaconda AND ("statistical analysis" OR "t test" OR "regression analysis" OR "quantitative analysis" OR "data analytics" OR "machine learning" OR "artificial intelligence" OR "analysis of variance" OR "anova" OR "chi square" OR "data mining")

Angoss: "Angoss"
[Acquired by Datawatch in 2018, now called Altair Knowledge Works]

Amazon:
"Amazon Machine Learning" OR "Amazon ML" OR "Amazon SageMaker" OR "AWS Deep Learning"

Apache Hadoop: "Hadoop"

Apache Mahout: Mahout -elephant -elefantes

Apache MXNet: "MXNet"
[adding terms like "deep learning" finds *very* few irrelevant articles]

Apache Pig: "Apache Pig"

Apache Spark: "Apache Spark"

"Bayesialab"

"BigML"

"BlueSky Statistics"

BMDP:
"BMDP" "Statistical Software"

CAFFE:
"Caffe" AND ("machine learning" OR "artificial intelligence" OR "data mining" OR "neural networks" OR "deep learning")
[Unlike MXNet, caffe is a common name as you can see with:
"caffe" -"machine learning" -"artificial intelligence" -"data mining" -"neural networks" -"deep learning"]

C++ / C#:
("C++" OR "C#")+("statistical analysis" OR
"t test" OR "regression analysis" OR
"quantitative analysis" OR
"data analytics" OR "machine learning" OR
"artificial intelligence" OR
"analysis of variance" OR "anova" OR
"chi square" OR "data mining")

"Civis Analytics"

"Cloudera"

"Databricks"

"Dataiku"

"DataRobot"

"DataScience.com"

"Datawatch" renamed Altair Knowledge Works in 2019

Dato or GraphLab
(Dropped due to acquisition by Apple)

Deducer GUI for R by Ian Fellows:
"Deducer" + "Fellows"
[Without quotes, "fellowship" will count]

Deeplearning4j: "Deeplearning4j"

Domino Data Lab:
"Domino Data Lab" OR "dominodatalab.com"

FICO: [Not tracking this as I've found it impossible to
separate the data science from the credit checking even
using inclusion factors.]

FORTRAN:
FORTRAN ("statistical analysis" OR "t test" OR "regression analysis"
OR "quantitative analysis" OR "data analytics" OR "machine learning"
OR "artificial intelligence" OR "analysis of variance")

GraphPad Prism: GraphPad

GNU Octave:
"GNU Octave" ("statistical analysis" OR "t test" OR "regression analysis"
OR "quantitative analysis" OR "data analytics" OR "machine learning"
OR "artificial intelligence" OR "analysis of variance" OR "anova"
OR "chi square" OR "data mining")

H2O:
"H2O.ai" OR "Oxdata H2O" OR "H2O deep learning" OR "H2O machine learning"

Hadoop: "Hadoop"

Google:
"Google Cloud Machine Learning" OR "Google Cloud AutoML" OR "Cloud Dataproc" OR "Cloud Datalab"
[Compare to job search terms, which are more broad;
consider updating.]

IBM SPSS:
SPSS -"SPSS Modeler" -"Amos"
[The letters "SPSS" stand for only a few other rare topics that I estimate results in over-counting by only 0.28%.]

IBM SPSS Modeler:
"SPSS Modeler"

"IBM Watson"

Infocentricity: "Infocentricity"
[They were purchased by Fico, so I'm dropping coverage]
[Note: do not add "OR Xeno", which is useful when
searching for jobs. It adds a great deal of ambiguity
to this type of search.]

jamovi: [minus signs remove languages where jamovi means jam]
"jamovi" -"jamova" -"za"

This approach will still undercount due to many non-English language papers:
"jamovi project" OR "jamovi program" OR "jamovi package" OR "jamovi software" OR "jamovi version" OR "jamovi.org" OR "jamovi stat" OR "stats with jamovi" OR "jamovi 0." OR "jamovi 2" OR "jamovi v" OR "software jamovi" OR "package jamovi" OR ""statistikprogrammet jamovi" OR "statistikprogramm jamovi"

JASP:
"jasp-stats.org" OR "jasp team" OR "JASP version"

Java:
java -author:java -weka -"Practical Machine Learning" -indonesian
("statistical analysis" OR "t test" OR "regression analysis"
OR "quantitative analysis" OR "data analytics" OR "machine learning"
OR "artificial intelligence" OR "analysis of variance")

JMP: "JMP" AND "SAS Institute"
[If you leave "SAS Institute" off, you'll add over 10,000 hits for Unicef's Joint Monitoring Programme (JMP), the JMP-134 bacterium, JMP The Label swimwear, JMP amplifiers, etc.]

Julia: [I had forgotten to add the OR's to this in earlier years, making it incomparable to MATLAB, etc.]
("Julia: A Fast Dynamic Language for Technical Computing" OR "julialang")
+ ("statistical analysis" OR "t test" OR
"regression analysis" OR "quantitative analysis" OR
"data analytics" OR "machine learning" OR
"artificial intelligence" OR "analysis of variance" OR
"anova" OR "chi square" OR "data mining")


Keras:
"keras" + ("machine learning" OR "artificial intelligence" OR "data mining" OR "neural networks" OR "deep learning")

KNIME: "KNIME"

KXEN: "KXEN"
(Bought by SAP, no longer tracking)

Lasagne:
Lasagne + ("machine learning" OR "artificial intelligence" OR "data mining" OR "neural networks" OR "deep learning")

Lavastorm: "Lavastorm"

Mathematica:
"Mathematica" ("statistical analysis" OR "t test" OR "regression analysis"
OR "quantitative analysis" OR "data analytics" OR "machine learning"
OR "artificial intelligence" OR "analysis of variance" OR "anova"
OR "chi square" OR "data mining")

MATLAB:
"MATLAB" ("statistical analysis" OR "t test" OR
"regression analysis" OR "quantitative analysis" OR
"data analytics" OR "machine learning" OR
"artificial intelligence" OR "analysis of variance" OR
"anova" OR "chi square" OR "data mining")

Megaputer: "Megaputer" OR "Polyanalyst"

Microsoft Cognitive Toolkit:
Microsoft + ("CNTK" OR "Cognitive Toolkit")
[Compare to job search terms, which are more broad;
consider updating.]

Microsoft Azure Machine Learning:
"Azure Machine Learning" OR "Azure ML"
[Compare to job search terms, which are more broad;
consider updating.]

Minitab: "Minitab"

MLlib: "MLlib"

NCSS: "Number Cruncher Statistical System"
[Cannot use "NCSS" for this as it stands for over
15 organizations]

OpenText (has many products unrelated to data science)
"OpenText" ("statistical analysis" OR "t test" OR "regression analysis" OR "quantitative analysis" OR "data analytics" OR "machine learning" OR "artificial intelligence" OR "analysis of variance" OR "anova" OR "chi square" OR "data mining")

Origin Pro:
"originlab" ("statistical analysis" OR "t test" OR "regression analysis"
OR "quantitative analysis" OR "data analytics" OR "machine learning"
OR "artificial intelligence" OR "analysis of variance" OR "anova"
OR "chi square" OR "data mining")

PAST: (Added 12/30/2023 so no data yet)
"PAST paleontological statistics software"

Pentaho: "Pentaho"
[Not tracking for now. Too many pieces that are not data science]

Prognoz: "Prognoz Platform"
[Note: Prognoz means forecast in Polish.]
[Dropped this in 2018 when count=2]

Python:
python -author:python -snake
("statistical analysis" OR "t test" OR
"regression analysis" OR
"quantitative analysis" OR "data analytics" OR
"machine learning" OR "artificial intelligence" OR
"analysis of variance" OR "anova" OR "chi square" OR
"data mining")

"PyTorch"

R:
"the R software" OR "the R project" OR "r-project.org" OR
"R development core" OR "bioconductor" OR "lme4" OR "nlme" OR
"lmeR function" OR "ggplot2" OR "Hmisc" OR "r function" OR
"r package" OR "mass package" OR "plyr package" OR "hmisc" OR "mvtnorm"
[Note: replacing plyr package with dplyr or tidyverse gets fewer hits]

R AnalyticFlow: "R AnalyticFlow"

R Commander GUI for R: "R Commander"

R-Instat: "R-Instat"

RapidMiner: "RapidMiner"

Rattle:
"Rattle: A Data Mining GUI" OR "Rattle GUI" OR "Rattle package"

Revolution Analytics: "Revolution Analytics"
[Note: Merged with Microsoft so keywords are uncertain.]

"RKWard"

Salford Systems: "Salford Systems"
(Bought by Minitab in March, 2017)

SAP:
"SAP Predictive Analytics" OR "SAP Automated Modeler" OR "SAP Leonardo Machine Learning" OR "SAP Hana"

SAS: "SAS Institute" -JMP -"Enterprise Miner"
[Note: This under counts SAS slightly but I haven't found a way around
the problem given that "Sas" is a popular first and last name for authors.
Also, in Spanish, "S.A.S." is the equivalent of "Inc." in English
(Sociedad por acciones simplificadas.)

SAS Enterprise Miner: "Enterprise Miner"

Scala: "Scala language" OR "language Scala" + spark

"Scikit Learn"

"Splunk"

SQL: SQL ("statistical analysis" OR "t test" OR "regression analysis" OR "quantitative analysis" OR "data analytics" OR "machine learning" OR "artificial intelligence" OR "analysis of variance" OR "anova" OR "chi square" OR "data mining")

Stata:
("stata" "college station") OR "StataCorp" OR "Stata Corp"
OR "Stata Journal" OR "Stata Press" OR "stata command"
OR "stata module" 
[Note: "stata" means "was" in Spanish, so by itself, you'll
get a greatly inflated number of hits!]

Statgraphics: "Statgraphics"

Statistica:
"Statistica" AND (Statsoft OR Dell OR Tibco OR "Quest Software" OR "Francisco Partners")
(This software has changed vendors a LOT lately!)

Systat: "Systat"

Tableau: (not tracking at the moment; too light on advanced analytics)
"Tableau Software" OR "Tableau Desktop" OR "Tableau Online"
OR "Tableau Server"
[Don't include "Tableau Public", it's a common French term.]

Tensorflow: "Tensorflow"

Theano:
"theano" + ("machine learning" OR "artificial intelligence" OR "data mining" OR "neural networks" OR "deep learning")

Tibco Spotfire:
Tibco + "Spotfire" -fire -burn
(not tracking recently; lacks advanced analytics)

Vowpal Wabbit: "Vowpal Wabbit"

WEKA:
WEKA ("machine learning" OR "data mining" OR "artificial intelligence")
[Note: The following search string used in previous
years (before March, 2015) under-counted.]

"WEKA Data Mining" OR "Waikato Environment for Knowledge Analysis"

World Programming: "WPS Analytics" ---not tested yet!

Inclusion Terms

While many of the packages are focused on data science, the more general-purpose ones — C++, C#, Java and Python — are not. So to determine the best way to focus the searches, I compiled a list of relevant terms commonly used in scholarly papers, then I searched for documents that included them, one at a time. I counted the number of documents for each term and tracked how likely it would result in an accurate hit. The latter was done using the time-honored, “I know it when I see it” approach. (I’m pretty familiar with advanced text analytics, but I don’t have time to extract all the data and do it.) The items marked with a “*” below show the terms used. These counts were collected on 5/11/2014, but given that I was searching across all years, the prevalence of the various terms is likely to shift slowly as time passes.

   Search Terms               Number of Articles
Survey  (not well focused)       5,300,000 
Statistical (not well focused)   4,860,000 
Statistics (not well focused)    4,770,000 
Statistical analysis *           3,670,000 
t test *                         3,480,000
Regression analysis *            2,920,000 
Linear regression                2,650,000 
Quantitative analysis *          2,570,000 
Data analytics *                 2,380,000 
Machine learning *               1,740,000 
Artificial intelligence *        1,720,000 
Analysis of variance             1,570,000
Chi square                       1,490,000
ANOVA                            1,340,000 
Survey research                  1,230,000 
Data mining *                    1,210,000 
Statistical software *           1,120,000 
logistic regression              1,080,000 
nonparametric                      800,000 
Analytics (not well focused)       519,000 
Statistical package                347,000 
Decision trees                     169,000 
Business intelligence              146,000 
Statistical modeling *             145,000 
Analyze data (not well focused)    125,000 
Big data *                          51,700 
Predictive modeling *               39,400 
Predictive analytics *               9,540 
Business analytics *                 7,660 
Advanced analytics                   3,700

I’m very interested in improving this methodology, so if you have ideas, please comment below or write me at muenchen.bob@gmail.com.

20 thoughts on “How to Search for Data Science Articles”

  1. I’m not sure how useful this may be but I recently discovered the Open Knowledge Maps project and their code is all open source – https://openknowledgemaps.org/ One thing I had wanted to look at is identifying programming language or tool prevalence within topic networks. Anyways, just throwing that out there in case it is helpful! Your articles have been so insightful!

  2. Hello Robert,

    Muchas Gracias Mi Amigo! You make learning so effortless. Anyone can follow you and I would not mind following you to the moon coz I know you are like my north star.

    As I need to apply color in Bar graph based on below mentioned condition,
    bar BOP = blue
    – bar In = red
    – bar Out = green
    – bar EOP = Green if EOPBOP
    Note : BOP, EOP, IN and OUT are values of one ‘Trend’ column.
    The Spotfire Server is the primary part of the Spotfire environment, to which all Spotfire clients join. Multiple connections are installed and attached to Spotfire Server. The Spotfire Web Player service and Spotfire Automation Services are installed on nodes to enable the usage of Spotfire web clients and the running of Spotfire Automation Services jobs.
    On X Axis we need to take ‘Trend’ column and on Y Axis we need to take uniquecount(Customer Number) column.
    But great job man, do keep posted with the new updates.

    Regards,
    Kevin

    1. Hi Kevin Lee,

      I’m glad you found my site useful. I get piles of email and don’t have time to do problem solving, so I recommend posting your question on Q&A sites like StackOverflow.com.

      Cheers,
      Bob

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.