Below is the latest update to The Popularity of Data Analysis Software.
Books
The number of books published on each software package or language reflects its relative popularity. Amazon.com offers an advanced search method which works well for all the software except R and the general-purpose languages such as Java, C, and MATLAB. I did not find a way to easily search for books on analytics that used such general purpose languages, so I’ve excluded them in this section.
The Amazon.com advanced search configuration that I used was (using SAS as an example):
Title: SAS -excerpt -chapter -changes -articles Subject: Computers & Technology Condition: New Format: All formats Publication Date: After January, 2000
The “title” parameter allowed me to focus the search on books that included the software names in their titles. Other books may use a particular software in their examples, but they’re impossible to search for easily. SAS has many manuals for sale as individual chapters or excerpts. They contain “chapter” or “excerpt” in their title so I excluded them using the minus sign, e.g. “-excerpt”. SAS also has short “changes and enhancements” booklets that the developers of other packages release only in the form of flyers and/or web pages, so I excluded “changes” as well. Some software listed brief “articles” which I also excluded. I did the search on June 1, 2015, and I excluded excerpts, chapters, changes, and articles from all searches.
“R” is a difficult term to search for since it’s used in book titles to indicate Registered Trademark as in “SAS(R)”. Therefore I verified all the R books manually.
The results are shown in Table 1, where it’s clear that a very small number of analytics software packages dominate the world of book publishing. SAS has a huge lead with 576 titles, followed by SPSS with 339 and R with 240. SAS and SPSS both have many versions of the same book or manual still for sale, so their numbers are both inflated as a result. JMP and Hadoop both had fewer than half of R’s count and then Minitab and Enterprise Miner had fewer then half again as many. Although I obtained counts on all 27 of the domain-specific (i.e. not general-purpose) analytics software packages or languages shown in Figure 2a, I cut the table off at software that had 8 or fewer books to save space.
Software Number of Books SAS 576 SPSS Statistics 339 R 240 [Corrected from: 172] JMP 97 Hadoop 89 Stata 62 Minitab 33 Enterprise Miner 32
Table 1. The number of books whose titles contain the name of each software package.
[Correction: Thanks to encouragement from Bernhard Lehnert (see comments below) the count for R has been corrected from 172 to the more accurate 240.]
Bob,
Since books in this category tend to stay in print for years, I think this method tends to overstate SAS’ share.
I tried to reproduce your results. Using the same search filter for SAS, I got a count of 698 items after January 2000. Repeating the search for each subsequent year, you get the number of titles published each year. Of course, this does not count titles published each year but no longer available on Amazon. Figuring that titles stay in print for at least three years, let’s just look at 2012-2014.
2000 698
2001 668
2002 657
2003 637
2004 614
2005 562
2006 529
2007 496
2008 459
2009 397
2010 324
2011 271
2012 221 (50)
2013 159 (62)
2014 96 (63)
Here’s R-project
2011 97
2012 83 (14)
2013 62 (21)
2014 37 (25)
SPSS:
2011 122
2012 106 (16)
2013 80 (26)
2014 47 (33)
An alternative way to approach the problem is to filter for a topic, then review the titles. For example, if we search for “Data Science” , filter for books published in the last 30 days, we get this:
Agnostic – 54
R – 1
Google Cloud – 1
SPSS – 1
Neo4j – 1
Oracle -1
Apache Solr -1
MongoDB
🙂
Regards,
Thomas
Hi Thomas,
Nice work! I just repeated my search on [Title: SAS -excerpt -chapter -changes -articles] and got 584 instead of the 576 on June 1. Since you got 698, there must be some difference in how we’re searching.
I really like your idea of searching for topics like “Data Science” but it would have to accept a lot of “OR” conditions to covert the gamut. The similar search I do on Google Scholar to focus general-purpose software is:
“statistical analysis” OR “t test” OR
“regression analysis” OR “quantitative analysis” OR
“data analytics” OR “machine learning” OR
“artificial intelligence” OR “analysis of variance” OR
“anova” OR “chi square” OR “data mining”
Those happen to be the phrases that got the top counts when done one at a time. Books could benefit from a similar list but so many packages had one or zero books that I didn’t bother to dig that far into it. That’s too bad as I think number of books (though inflated for some) provides much more reliable data than other sources such as surveys.
Cheers,
Bob
Bob,
Since books in this category tend to stay in print for years, I think this method tends to overstate SAS’ share.
I tried to reproduce your results. Using the same search filter for SAS, I got a count of 698 items after January 2000. Repeating the search for each subsequent year, you get the number of titles published each year. Of course, this does not count titles published each year but no longer available on Amazon. Figuring that titles stay in print for at least three years, let’s just look at 2012-2014.
2000 698
2001 668
2002 657
2003 637
2004 614
2005 562
2006 529
2007 496
2008 459
2009 397
2010 324
2011 271
2012 221 (50)
2013 159 (62)
2014 96 (63)
Here’s R-project
2011 97
2012 83 (14)
2013 62 (21)
2014 37 (25)
SPSS:
2011 122
2012 106 (16)
2013 80 (26)
2014 47 (33)
An alternative way to approach the problem is to filter for a topic, then review the titles. For example, if we search for “Data Science” , filter for books published in the last 30 days, we get this:
Agnostic – 54
R – 1
Google Cloud – 1
SPSS – 1
Neo4j – 1
Oracle -1
Apache Solr -1
MongoDB
🙂
Regards,
Thomas
Hi Thomas,
Nice work! I just repeated my search on [Title: SAS -excerpt -chapter -changes -articles] and got 584 instead of the 576 on June 1. Since you got 698, there must be some difference in how we’re searching.
I really like your idea of searching for topics like “Data Science” but it would have to accept a lot of “OR” conditions to covert the gamut. The similar search I do on Google Scholar to focus general-purpose software is:
“statistical analysis” OR “t test” OR
“regression analysis” OR “quantitative analysis” OR
“data analytics” OR “machine learning” OR
“artificial intelligence” OR “analysis of variance” OR
“anova” OR “chi square” OR “data mining”
Those happen to be the phrases that got the top counts when done one at a time. Books could benefit from a similar list but so many packages had one or zero books that I didn’t bother to dig that far into it. That’s too bad as I think number of books (though inflated for some) provides much more reliable data than other sources such as surveys.
Cheers,
Bob
“The number of books published on each software package or language reflects its relative popularity”
Do you think this is a valid premise? I wonder if there is some bias for people who use free and open source languages (namely, R for this case) to prefer learning from web searches vice paying for a book.
Another interesting analysis might be to look at the count of unique authors vice unique titles. If books are published by the same company that creates the software, the economics and motivation for creating works is probably very different than for an individual writing a book about a language that they didn’t create, but find very useful.
Hi Seth,
In the main article I list the various types of data that I collect and the strengths and weaknesses of each:
Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
Website Popularity – the PageRank measure is objective data, and for sites that clearly focus on analytics, it’s unbiased and especially useful for weeding out the weaker software. However, so much market consolidation has occurred that now focused analytic tools like SPSS are listed under corporations with much broader interests (IBM in that case). In addition, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, that have nothing to do with its use for analytics.
Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Unfortunately, this measure is very hard to collect except where sites exist to maintain such lists.
Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey provider tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often limited, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires just a few mouse clicks, rather than the commitment required to place a job advertisement or publish a scholarly article, book or blog post. As a result, it’s not unusual to see market share jump 100% or drop 50% in a single year, which is very unlikely to reflect changes in actual use.
Discussion Forum Activity – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others. While talk may be cheap, it’s still a good indicator of popularity.
Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment. However, very popular commercial software may not have much user development activity.
Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
IT Research Firm Reports – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed. While I find the reports very interesting reading, they often focus on the company level so it’s harder to get package-level information from them.
Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
Competition Use – organizations that sponsor analytic competitions occasionally report what the winners tend to use. Unfortunately this information is only sporadically available.
Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t provide this information and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.
I totally agree that more can be done with books, but maintaining all that data takes quite a lot of time! What makes it tough to compare cost of books vs. free web searches is that all books are now available for free via countless pirate sites.
Cheers,
Bob
“The number of books published on each software package or language reflects its relative popularity”
Do you think this is a valid premise? I wonder if there is some bias for people who use free and open source languages (namely, R for this case) to prefer learning from web searches vice paying for a book.
Another interesting analysis might be to look at the count of unique authors vice unique titles. If books are published by the same company that creates the software, the economics and motivation for creating works is probably very different than for an individual writing a book about a language that they didn’t create, but find very useful.
Hi Seth,
In the main article I list the various types of data that I collect and the strengths and weaknesses of each:
Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
Website Popularity – the PageRank measure is objective data, and for sites that clearly focus on analytics, it’s unbiased and especially useful for weeding out the weaker software. However, so much market consolidation has occurred that now focused analytic tools like SPSS are listed under corporations with much broader interests (IBM in that case). In addition, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, that have nothing to do with its use for analytics.
Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Unfortunately, this measure is very hard to collect except where sites exist to maintain such lists.
Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey provider tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often limited, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires just a few mouse clicks, rather than the commitment required to place a job advertisement or publish a scholarly article, book or blog post. As a result, it’s not unusual to see market share jump 100% or drop 50% in a single year, which is very unlikely to reflect changes in actual use.
Discussion Forum Activity – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others. While talk may be cheap, it’s still a good indicator of popularity.
Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment. However, very popular commercial software may not have much user development activity.
Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
IT Research Firm Reports – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed. While I find the reports very interesting reading, they often focus on the company level so it’s harder to get package-level information from them.
Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
Competition Use – organizations that sponsor analytic competitions occasionally report what the winners tend to use. Unfortunately this information is only sporadically available.
Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t provide this information and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.
I totally agree that more can be done with books, but maintaining all that data takes quite a lot of time! What makes it tough to compare cost of books vs. free web searches is that all books are now available for free via countless pirate sites.
Cheers,
Bob
Hi!
Thank you for the work to try to get valid data on this interesting topic. However, I see no point in comparing an actual search engine at Amazon with a handwritten and annotated list on http://www.r-project.org. Andy Field has written an enormously successfull book on statistics using SPSS and then switched to R. His book is a classic for students but it is not listed on http://www.r-project.org/doc/bib/R-books.html . Kruschkes well known Book on Bayesian statistics uses R but is not on that list. On the other hand, Chambers’s book “Software for Data Analysis” is obviously a book on R and is on the list but you’d never guess that from the title. Burns’s “R Inferno” is a very worthwile book that every experienced R user should read at some point, but it is not on the list.
So there seems to be a considerable bias in the different sources for R books as opposed to the other books.
Cheers,
Bernhard
Hi Bernhard,
Thanks for encouraging me to do what I should have done in the first place: verify the search manually. I went through the first 60 pages of results and found 240 books on R. They were getting very sparse by that point so I hope I only missed a small number. Please check out how I reworded the post and please verify that I spelled your name correctly!
Thanks,
Bob
Hi Bob,
that was fast! And a lot of work. Being an R enthusiast I am inclined to point out, that not only books with “R” the name of the implementation but also with “S”, the name of the language, should be included. MASS ( http://www.stats.ox.ac.uk/pub/MASS4/ ) would be an example which has some impact in the R-Community but I can see that this is another one-letter-name which will bring up a lot of false hits and I guess the number of additional books added by this may be negligible (a fast look a my booshelf shows MASS as the only specimen of that kind in my home). So you can probably leave it at that.
Again, thanks for the work on the topic!
Cheers,
Bernhard
Hi!
Thank you for the work to try to get valid data on this interesting topic. However, I see no point in comparing an actual search engine at Amazon with a handwritten and annotated list on http://www.r-project.org. Andy Field has written an enormously successfull book on statistics using SPSS and then switched to R. His book is a classic for students but it is not listed on http://www.r-project.org/doc/bib/R-books.html . Kruschkes well known Book on Bayesian statistics uses R but is not on that list. On the other hand, Chambers’s book “Software for Data Analysis” is obviously a book on R and is on the list but you’d never guess that from the title. Burns’s “R Inferno” is a very worthwile book that every experienced R user should read at some point, but it is not on the list.
So there seems to be a considerable bias in the different sources for R books as opposed to the other books.
Cheers,
Bernhard
Hi Bernhard,
Thanks for encouraging me to do what I should have done in the first place: verify the search manually. I went through the first 60 pages of results and found 240 books on R. They were getting very sparse by that point so I hope I only missed a small number. Please check out how I reworded the post and please verify that I spelled your name correctly!
Thanks,
Bob
Hi Bob,
that was fast! And a lot of work. Being an R enthusiast I am inclined to point out, that not only books with “R” the name of the implementation but also with “S”, the name of the language, should be included. MASS ( http://www.stats.ox.ac.uk/pub/MASS4/ ) would be an example which has some impact in the R-Community but I can see that this is another one-letter-name which will bring up a lot of false hits and I guess the number of additional books added by this may be negligible (a fast look a my booshelf shows MASS as the only specimen of that kind in my home). So you can probably leave it at that.
Again, thanks for the work on the topic!
Cheers,
Bernhard
Hi Bernhard,
Yeah, I thought about that afterwards. I think I mentioned preivously that there are 7 books on S that are still considered core references for R. I’m not sure MASS was one of those. At least the count is much closer now!
Cheers,
Bob
Hi Bernhard,
Yeah, I thought about that afterwards. I think I mentioned preivously that there are 7 books on S that are still considered core references for R. I’m not sure MASS was one of those. At least the count is much closer now!
Cheers,
Bob