I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…
Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution R Enterprise, or SAS vs. the World Programming System (WPS) or Carolina, except when it is particularly easy such as comparing the company Pagerank figures.
These packages are all included in the first section on jobs, but later sections are older (each contains a date) and do not cover an as extensive set of software. I’ll add those as I can and announce the changes on Twitter where you can follow me as @BobMuenchen.
When choosing a tool for data analysis, now more broadly referred to as analytics, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output in the form you prefer (e.g. cut & paste into a word processor vs. LaTeX integration)? Does it handle large enough data sets? Do your colleagues use it so you can easily share data and programs? Can you afford it?
There are many ways to measure popularity or market share and each has its advantages and disadvantages. Here they are, in approximate order of usefulness:
- Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
- Published Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
- Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
- Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. What makes this measure particularly easy to gather is that consolidators like Tal Galili have created blog consolidation sites like R-Bloggers.com which make it easy to count the blogs. Previously that had been a difficult task.
- Web Site Popularity – how does Google provide the most popular search results at the top of its response to your queries? A major component of that answer comes from the total number of web pages that point to any given web site. That’s known as a site’s PageRank. This is objective data, and for sites that clearly focus on analytics, it’s unbiased. However, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, and probably fewer that discuss analytics point to it as well. But it may be impractical to tell which is which.
- Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
- Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment.
- Discussion Forums – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others.
- Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
- IT Research Firms – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
- Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
- Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.
One of the best ways to measure the popularity or market share of software for analytics is to count the number of job advertisements for each. Indeed.com is the biggest job site in the U.S. making its sample the most representative of the current job market. As their CEO and co-founder Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” To demonstrate just how dominant its lead is, a search for SPSS (on 2/19/14) showed more than ten times as many jobs on Indeed.com as on its well-known competitor, Monster.com. Indeed.com also has superb search capabilities and it even includes a tool for tracking long-term trends.
Searching for analytics jobs using Indeed.com can be easy, but it can also be very tricky. For many of the analytics software that required only a simple search on its name. However, for software that’s hard to locate (e.g. R) or that is general purpose (e.g. Java) it required complex searches and/or some rather tricky calculations which are described here. All of the graphs in this section use those procedures to make the required queries.
Figure 1a shows that Java and SAS are in a league of their own, with around 50% more analytics jobs than Python or C, C++/C# and twice as much as R. (The three aforementioned C variants are combined in a single search since job advertisements usually seek any of them). Python and C/C++/C# come next at an almost identical level of popularity. That’s not too surprising as many advertisements for analytics jobs that use programming mention both together.
R resides in an interestingly large gap between the other domain-specific languages, SAS and SPSS. This is the first estimate I’ve done that shows that the job market for R has not only caught up with SPSS, but surpassed it by close to double the number of job postings. I knew my previous estimates for R jobs was low, but I had not yet thought of a better way to estimate the total. From SPSS on down, there’s a smooth decline. Enterprise Miner is the only data-mining-specific software to make the cutoff of at least 100 jobs. If I plotted all the software below that point, they would all pile up on the y-axis, appearing to have almost no jobs. Relatively speaking, they don’t!
Software that did not make that cut and are not displayed on the graph are: Alteryx (68), Statistica (67), RapidMiner (38), SPSS Modeler (36), KXEN (28), KNIME (26), Julia (15), Statgraphics (11), Systat (10), BMDP (8), Angos (6), Lavastorm (5), NCSS (4), Salford SPM etc. (3), Teradata Miner (2) and Oracle Data Mining (2).
It’s important to note that the values shown in Figure 1a are single points in time. The number of jobs for the more popular software do not change much from day to day, but each software has an overall trend that shows how the demand for jobs changes across the years. You can plot such trends using Indeed.com’s Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time means they both have to fit in the same query limit allowed by Indeed.com. Those details are described here.
I’m particularly interested in trends involving R, so let’s look at a couple of comparisons. Figure 1b compares the number of analytics jobs available for R and SPSS across time. Analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.
We know from Figure 1a that SAS is still far ahead of R in analytics job postings. How far does R have to go to catch up with SAS? Figure 1c provides one perspective. It would be nice to have the data to forecast when R’s growth curve will catch up with SAS’s, but Indeed.com does not provide the raw data. However, we can use the approximate slope of each line to get a rough estimate. If jobs for SAS stay level and those for R continue to grow linearly as they have since January 2010, then R will catch up in 3.35 years. If instead the demand for SAS jobs that started in January of 2012 continues, then R will catch up in 1.87 years.
A debate has been taking place on the Internet regarding the relative place of Python and R. Ironically, this debate about software to do data analytics has involved very little actual data. However it is possible now to at least study the job trends. Figure 1a showed us that Python is well out in front of R, at least on that single day the searches were run. What has the data looked like over time? The answer is in Figure 1d.
Note that in this graph, Python appears to have a relatively slight advantage while in Figure 1a it had a huge one. The final point on the trend graph was done only two days after the queries used in Figure 1a, and that data changed very little in the meantime. The difference is due to the fact that Indeed.com has a limit on query length. Here is the query used for Figure 1c, and the analytic terms it contains were fewer than the one used for Figure 1a.
R and ("big data" or "statistical analysis" or "data mining" or "data analytics" or "machine learning" or "quantitative analysis" or "business analytics" or "statistical software" or "predictive modeling") !"R D" !"A R" !"H R" !"R N" !toys !kids !" R Walgreen" !walmart !"HVAC R" !"R Bard" , python and ("big data" or "statistical analysis" or "data mining" or "data analytics" or "machine learning" or "quantitative analysis" or "business analytics" or "statistical software" or "predictive modeling")
The detailed description regarding the construction of all the queries used in Figures 1a through 1c is located here.
At this point, the rest of The Popularity of Data Analysis Software will continue, offering many additional perspectives on measuring analytics market share. However, until I update those sections in the coming months, they will not cover as broad a range of software. Stay tuned on Twitter, by following @BobMuenchen.
If you know SAS, SPSS or Stata and have not yet learned R, you can join me for this web-based workshop aimed at translating your knowledge into R. The next workshop begins on April 21. If you do know R and would like to learn more, you might enjoy taking Managing Data with R. The next time I’m offering that is on April 25.
19 thoughts on “Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!”
No doubt knowing R will help maximize your chances of getting a job, but there are far more jobs for SAS than R in the U.S. at the moment.
What about SQL?
You Wharton folks always ask interesting questions! I probably should have commented on this in the main article. Many job descriptions mention SQL directly, while others mention instead the need to be able to work with Oracle, MySQL, etc. So I faced the question of under-counting with just “SQL” as a search string, or perhaps over-counting if a mention of a database (but no SQL) actually referred to some other tool, such as scoring models using PMML. I decided to take it as a given that all analytics jobs require at least basic SQL skills. Excel is in a similar situation. Some surveys offer Excel as a choice for a “data mining” tool, which seems a bit of a stretch to me. However, probably all but the hardest-core open source folks in analytics jobs know basic Excel. So both are out at the moment, but I’m definitely open to rethinking those positions.
I just looked over the February 2014 Gartner report on Business Intelligence and Analytics Platforms. They look at 27 “platforms” (e.g. they view Cognos and all SPSS tools as simply “IBM”) and I look at 34 tools and we only overlap on 6! I was looking more for predictive or modeling analytics, while BI tends to be more reporting, but increasingly BI tools claim to include more advanced analytics. Gartner’s report on Advanced Analytics covers 16 major platforms, 11 of which overlap with mine. I’ll be looking at the ones they consider to be in the area to see if I should add them.
Great posting, and appreciate the detailed response.
From a quick sampling of predictive analytics postings out on Indeed, it seems to me that employers generally value experience with ANY package, and rarely specify one and only one language as necessary.
In other words, employers seem more concerned about domain experience and quant skills than specific language skills.
Yes, that’s what allowed me to get a lower-estimate on R jobs by searching for “R or SAS”, “SAS or R” etc. in 160 combinations (described here: http://r4stats.com/articles/how-to-search-for-analytics-jobs/). However, if I applied that type of search to every piece of software, I’ll bet the graph would come out much the same. The total counts would be lower since that approach would miss those jobs that want, for example, just a SAS programmer and they don’t want to retrain an SPSS programmer to fill that role.
With regard to the SQL question, and R sui generis (no idea whether either is possible):
– consider the use of R, at least, for developing within the database (PL/R, for example) as opposed to standard approach of RStudio, etc. accessing the database from an R session. PL/R is specific to PG (and should work with its descendants), and SAP, Oracle and Netezza provide similar support (could even be re-badged PL/R, for all I know). DB2 and SQL Server, not yet that I know of. Nor do I know whether any of the other stat packs have similar functionality.
– with regard to R, its use as stat command language and programming language is not so distinct compared to SAS/SPSS/etc. which have different syntaxes/languages for those functions, and tends to be viewed as a DSL by many. For that perspective, Python is the better comparator, likely. How much work gets done in each of these applications, between command and programming, will give us a clue as to trend. If command is dominant and growing, then displacing SAS/SPSS is tougher. If programming is ditto, then R has the advantage.
The SQL consideration will clearly be far fewer adverts, but tracking them, if possible, would spot an important trend (if increasing, of course). With multi-processor, SSD, behemoth memory machines possible for less the $10K, integrating (standard) R into database processing is no longer a pipe dream. That’s less then a month’s pay for a journeyman data scientist/analyst these days.
The syntax issue is more directly germane: positions needing data analysis with existing support from the stat pack are much different from “programming stats” jobs. Not to say that one is better than the other, but they’re quite different skill sets.
Reblogged this on mystatscastle.
Reblogged this on Amateur Scientist .
The “C / C++ / C#” category is meaningless 🙁
On a side-note, “C/C++” would be meaningless, too: http://www.stroustrup.com/bs_faq.html#C-slash
It’s comparable to listing “SAS” and “S-PLUS” together because they both start with an “S”… except maybe that at least “SAS” and “S-PLUS” have some things in common 😉
Your SAS / S-PLUS comparison cracks me up! C++ and C# certainly have important distinctions, but at least both are built on a foundation of C. SAS and S-PLUS don’t share such a common core of commands. However, I did wonder about putting them all into one category. I combined them because I hardly ever saw a job posting that asked for just one of those three without another right next to each other. For analytics jobs, they’re often advertised as asking for C/C++ or C/C# or even C/C++/C# skills. To see this, go to Indeed.com and enter the following query:
“C++” (“big data” or “statistical analysis” or “data mining” or “data analytics” or “machine learning” or “quantitative analysis” or “business analytics” or “statistical software” or “predictive modeling”)
Then pick a few ads and search for “C++” or “C#” in the body of the ad (not just titles). That will show you what I mean; let me know what you think. I’m still open to the idea of listing them separately, but I’d be more likely to either pick one or drop them all as breaking them into three starts me down the slippery slope of trying to cover all languages, which I don’t have time to do.
Thanks for writing, you’ve gotten me rethinking this issue.
Yeah, I can see the point on the measurements count, but I still think the current combination is inappropriate.
The main point is that C# (despite its name) actually doesn’t have much in common with C (not even historically, other than the initial inspiration on a very superficial level, like curly braces (but by this criterion Java would be indistinguishable) — and not even in any “general” sense, there simply isn’t any useful common denominator here — at least not beyond what, say, SAS or S-PLUS have in “common” — as in, e.g., (not) using curly braces).
If anything, C and C++ have a *bit* (but only a little bit) more in common (both being often classified as “native” programming languages, i.e., ones for which all existing mainstream implementations compile to the native machine code by default, as opposed to a virtual machine bytecode of C# or Java). At least before the late 1980s there still was a subset of C that would be possible to compile with a C++ compiler (nowadays, of course, both languages have diverged significantly enough that we no longer even have a proper subset-superset relationship: even the built-in language constructs won’t be mutually compatible anymore: think, e.g., of the `auto` keyword; so even this would be quite a stretch perhaps).
Note that idiomatic/common/recommended programming practices are wildly different, too (with obvious implications for the average programmers’ productivity): in C you only have manual memory management, similarly in late-1980s/early-1990s C-with-classes (C++ written as C by programmers coming from C, although there already was some form of automatic resource management, particularly in the standard library — and the RAII achieved via constructors/destructors), Java or C# will give one some limited resource management support (limited, since usually only addressing automatic memory management — so-called “garbage collection” — but not addressing files/locks/threads/databases/network connections, all rather important for a data scientist), with modern C++11/C++14 turning in the direction of full automatic resource management (not just memory, although that’s pretty well supported, primarily via as-fast-as-doing-it-manually-but-automatic std::unique_ptr, and then std::shared_ptr/std::weak_ptr as the last resort(s)) and shying away from the manual (with the standard library being quite well-known and adopted by now).
If anything, it could perhaps make *slightly* more sense to keep “C” and “C++” together (“native programming languages” category?) — while instead making “C#” and “Java” combined (whether CLR or JVM, all existing mainstream implementations of both are generating virtual machine bytecode by default). But it would still be disputable whether that would be any more methodologically honest than the SAS/S-PLUS example — while one could say that “strong statistical foundation” would be a thing they have in common, I don’t think that makes it a good idea to combine them, since we’d lose a fair amount of information that way.
(Note, however, that spreading the “C/C++” term may cause a potential career disservice to the readers, see the link in my previous comment.)
Hope that helps!
I found your chart about the most popular analytics software very interesting, especially seeing Python in third place. Around the web most of the buzz seems to be around R, but in my experience Python has proven a more useful tool for analytical applications in business. I think also of note for the Python is the fact that some of the world’s top data scientists, over at Kaggle, use it to win competitions. Even better, they publish their code for everyone to learn from.
I cover the topic of marketing analytics on the Vault Analytics website and am currently writing a series (http://vaultanalytics.com/marketinganalytics/) on how to get a career in analytics, in which I recommend learning Python. It’s nice to see your data backing up the suggestion.
We certainly have some great free choices available. Life is good!