by Robert A. Muenchen
This article describes the technical details of how to search for jobs in the field of data science. The results of the searches are displayed and discussed in The Popularity of Data Science Software. The protocols were implemented 2/27/2017 and they are significantly different from the previous set posted on 2/20/2014.
Data Science Terms
Some software used for data science is also used by a wide range of other tasks. Let’s consider a few examples. General purpose languages, such as C, Java, or Python are used heavily for some data science tasks, but if you do a job search on just their names, the great majority of jobs found will not be for data science. Other software such as Cognos, SAS, and Tableau are very popular for simple report writing as well as for data science jobs. Therefore simple searches will find a blend of both types of jobs. Finally, some software, such as Apache Spark, SPSS, or Stata are very specific to data science. With such a mix of software, the challenge is to use search terms that will yield values that are comparable across all types of software.
To compile a list of search terms that are specific to data science jobs, I started out searching for jobs that required software that is used specifically for data science. I then looked for terms that often appeared in those job descriptions. Next, I searched for jobs that featured only those terms, one at a time. Some, such as “analytics” resulted in searches that were not well focused; jobs that had nothing to do with data science would appear. Others, such as “econometrics”, did indeed focus on data science jobs, but only in the field of economics. As I worked my way through these searches, I found more search terms to test. The results are shown in Table 1.
|Search Terms||Jobs Found|
|Analytics (not well focused)||123,895|
|Survey (not well focused)||72,323|
|Statistics (not well focused)||66,201|
|Statistical (not well focused)||55,998|
|Analyze data (not well focused)||20,068|
|Business intelligence (too much reporting)||19,709|
|Business analytics *||4,043|
|Research associate (too vague)||3,794|
|Econometrics (too focused)||1,860|
Table 1. Terms used in data science job descriptions
Ideally, one could include all the focused terms in a search, but Indeed.com’s search feature limits the size of the search string. To determine the maximum string size, I put in the longest software string and then added in the data science terms. The data science terms then truncated to show the limit. Table 2 shows the resulting set of search terms that I used to append to each software title. For example, when searching for Java, I would enter: Java and (“big data” or “data analytics” or …”statistician”).
and ("big data" or "data analytics" or "machine learning" or "statistical analysis" or "data mining" or "data science" or "quantitative analysis" or "business analytics" or "advanced analytics" or "data scientist" or "statistical software" or "predictive analytics" or "artificial intelligence" or "predictive modeling" or "statistical modeling" or "quantitative research" or "research analyst" or "statistical tools" or "statistician")
Table 2. The data science terms and logic that are appended to every job search.
Some software offered additional challenges. Those with letter names, C and R, were found using spaces before and after their names, such as (” R ” or ” R,”) . This isn’t a perfect solution since it would count an advertisement for a data scientist skilled in SAS at the “Toys R Us” company as a job for an someone with R skills. Conversely, the search for an R programmer at the SAS Shoe company would also be counted as one for a SAS programmer. Many of these searches have flaws like that, but the size of the search limits the accuracy. However, if you look through the resulting job advertisements, you’ll see that errors with this search approach are rare.
When advertisements list the C language, it’s most often in the form of “C, C++, or C#” so no attempt was made to differentiate those variants. However Objective C was usually advertised for iPad or iPhone application development, so it was excluded.
Microsoft presented another challenge. Just its name combined with the data science terms yielded results that were heavily biased by the inclusion of general-purpose tools such as Microsoft SQL Server. Focusing the search with (“Azure Machine” or “Azure Stream” or “Microsoft R” or “Cortana Intelligence” or “Microsoft Cognitive” or CNTK) used up so much space that two of the data science terms had to be dropped: “statistical tools” and “statistician”.
Another challenging search was for Domino Data Labs’ Data Science Platform. The search (Domino and “Data Science Platform”) found no jobs, not even for those from the company itself! Just the term “Domino” along with the data science terms found mostly job descriptions that mentioned Lotus Domino. For the 2017 search, I simply culled the small number of results down by hand.
Similarly, the search for Alpine and the data science terms yielded hits that were mostly irrelevant, so I culled them manually.
The Search Terms
Table 3 shows the search terms used for each software. See Table 1 for the data science terms that were appended to every search except Microsoft, whose complete search is shown below.
Alpine and data sci terms (then read them, most are invalid) Alteryx and data sci terms "Amazon Machine Learning" and data sci terms Angoss and data sci terms Apache Flink: Flink and data sci terms (drop Flink next year; not general enough) Apache Hadoop: Hadoop and data sci terms Apache Pig: Pig and data sci terms Apache Spark: Spark and data sci terms "Azure Machine Learning" and data sci terms BMDP and data sci terms ("C programmer" or "C programming" or "C developer" or "C++" or "C#") and !("objective c") and data sci terms Caffe and data sci terms Dataiku and data sci terms Domino Data Labs: "domino data" and data sci terms "Enterprise Miner" and data sci terms FICO and data sci terms (Infocentricity or Xeno) and data sci terms H2O: "H2O" and data science terms JMP and data sci terms Julia and data sci terms KNIME and data sci terms Lavastorm and data sci terms MATLAB and data sci terms (Megaputer or Polyanalyst) and data sci terms Minitab and data sci terms Microsoft: ("Azure Machine" or "Azure Stream" or "Microsoft R" or "Cortana Intelligence" or "Microsoft Cognitive" or CNTK) and and ("big data" or "data analytics" or "machine learning" or "statistical analysis" or "data mining" or "data science" or "quantitative analysis" or "business analytics" or "advanced analytics" or "data scientist" or "statistical software" or "predictive analytics" or "artificial intelligence" or "predictive modeling" or "statistical modeling" or "quantiative research" or "research analyst") so it's missing: or "statistical tools" or "statistician" NCSS and data sci terms Salford and (SPM or CART or MARS or TreeNet or RandomForests or GPS or RuleLearner or ISLE) and data sci terms (leaving off the latter didn't help at all in 2017) R: (" R " or " R,") and data sci terms RapidMiner + data sci terms SAP and data sci terms SAS !"Enterprise Miner" and data sci terms Scala: "Scala" and data sci terms "Splunk" Spotfire and data sci terms SPSS: SPSS and !"SPSS Modeler" and data sci terms "SPSS Modeler" and data sci terms Stata and data sci terms Statgraphics and data sci terms Statistica and data sci terms Systat and data sci terms Tableau and data sci terms Tensorflow and data sci terms Tibco: "Tibco Spotfire" OR "Tibco TERR" OR "Tibco Enterprise" and data sci terms (WEKA or Pentaho) and data sci terms
Table 3. Search terms used for each software (see Table 1 for data science terms).
Searching for Trends
Indeed.com has a Job Trends tool that lets you see how jobs are changing across the last several years. You can enter one or more searches from one of the examples above to see the trends. Unfortunately, the search for trends must be much simpler than Indeed.com’s main job search.The best pair of queries I could get to compare R and SAS is:
R and ("big data" or "data analytics" or "machine learning" or "statistical analysis" or "data mining" or "data science") SAS and ("big data" or "data analytics" or "machine learning" or "statistical analysis" or "data mining" or "data science")
Now that you’ve got the details, check out the results here. I’m very interested in improving this methodology so if you have ideas, please comment below or send me email at firstname.lastname@example.org.