Analytics | r4stats.com

Business Intelligence and Data Science Groups in East Tennessee

The Knoxville area has four groups that help people learn about business intelligence and data science.

The Knoxville R Users Group (KRUG) focuses on the free and open source R language. Each meeting begins with a bit of socializing followed by a series of talks given by its members or guests. The talks range from brief five-minute demos of an R function to 45-minute in-depth coverage of some method of analysis. Beginning tutorials on R are occasionally offered as well. Membership is free of charge, but donations are accepted to defray the cost of snacks and web site maintenance. You can join at the KRUG web site.

Data Science KNX is a group of people interested in the broad field of data science. Members range from beginners to experts. As their web site states, their “…aim is to maintain a forum for connecting people around data science specific topics such as tutorials and their applications, local success stories, discussions of new technologies, and best practices. All are welcome to attend, network, and present!” You can go here to join at the Data Science KNX web site. Membership is free, though the group gladly accepts donations to help defray the costs of the pizza and beer provided at their meetings.

The East Tennessee Business Intelligence Users Group is “committed to learning, sharing, and advancing the field of Business Intelligence in the East Tennessee region.” They meet several times each year featuring speakers who demonstrate business intelligence software such as IBM’s Watson and Microsoft’s PowerBI. Meetings are at lunch and a meal is provided by sponsoring companies. Membership is free, and so is the lunch! You can join the group at their web site.

Each spring and fall, The University of Tennessee’s Department of Business Analytics and Statistics offers a Business Analytics Forum that features speakers from both industry and academia. The group consists of non-competing companies for whom business analytics is an important part of their operation. Forum members work together to share best practices and to develop more effective strategies. The forum is open to paid members only and you can join on their registration page.

Using Discussion Forum Activity to Estimate Analytics Software Market Share

I’m finally getting around to overhauling the Discussion Forum Activity section of The Popularity of Data Analysis Software. To save you the trouble of reading all 43 pages, I’m posting just this section below.

Discussion Forum Activity

Another way to measure software popularity is to see how many people are helping one another use each package or language. While such data is readily available, it too has its problems. Menu-driven software like SPSS or workflow-driven software such as KNIME are quite easy to use and tend to generate fewer questions. Software controlled by programming requires the memorization of many commands and requiring more support. Even within languages, some are harder to use than others, generating more questions (see Why R is Hard to Learn).

Another problem with this type of data is that there are many places to ask questions and each has its own focus. Some are interested in a classical statistics perspective while others have a broad view of software as general-purpose programming languages. In recent years, companies have set up support sites within their main corporate web site, further splintering the places you can go to get help. Usage data for such sites is not readily available.

Another problem is that it’s not as easy to use logic to focus in on specific types of questions as it was with the data from job advertisements and scholarly articles discussed earlier. It’s also not easy to get the data across time to allow us to study trends. Finally, the things such sites measure include: software group members (a.k.a. followers), individual topics (a.k.a. questions or threads), and total comments across all topics (a.k.a. total posts). This makes combining counts across sites problematic.

Two of the biggest sites used to discuss software are LinkedIn and Quora. They both display the number of people who follow each software topic, so combining their figures makes sense. However, since the sites lack any focus on analytics, I have not collected their data on general purpose languages like Java, MATLAB, Python or variants of C. The results of data collected on 10/17/2015 are shown here:

We see that R is the dominant software and that moving down through SAS, SPSS, and Stata results in a loss of roughly half the number of people in each step. Lavastorm follows Stata, but I find it odd that there was absolutely zero discussion of Lavastorm on Quora. The last bar that you can even see on this plot is the 62 people who follow Minitab. All the ones below that have tiny audiences of fewer than 10.

Next let’s examine two sites that focus only on statistical questions: Talk Stats and Cross Validated. They both report the number of questions (a.k.a. threads) for a given piece of software, allowing me to total their counts:

We see that R has a 4-to-1 lead over the next most popular package, SPSS. Stata comes in at 3rd place, followed by SAS. The fact that SAS is in fourth place here may be due to the fact that it is strong in data management and report writing, which are not the types of questions that these two sites focus on. Although MATLAB and Python are general purpose languages, I include them here because the questions on this site are within the realm of analytics. Note that I collected data on as many packages as were shown in the previous graph, but those not shown have a count of zero. Julia appears to have a count of zero due to the scale of the graph, but it actually had 5 questions on Cross Validated.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R? I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Rexer Analytics Survey Results

Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.

Surveys of Use

One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics conducts such a survey every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 6a shows the tools that the 1,220 respondents reported using in 2015.

Figure 6a. Analytics tools used. — Figure 6a. Analytics tools used by respondents to the Rexer Analytics Survey. In this view, each respondent was free to check multiple tools.

We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.

The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be over counting. However, the survey also asked, “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?” Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 6b. Since other software combinations are likely, e.g. SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.

Figure 6b. The percent of survey respondents who checked each package as their primary tool. — Figure 6b. The percent of survey respondents who checked each package as their *primary* tool. Note that free and commercial versions of KNIME and RapidMiner are combined. Multiple tools from the same company are also combined. Only the top 10 are shown.

In this view we see R even more dominant, with over a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g. research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration and performance, rather than limiting analytic features or access to popular data formats.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Free Webinar: Intro to SparkR

Are you interested in combining the power of R and Spark? An “Intro to SparkR”
webinar will take place on July 15, 2015 at 10 am California time. Everyone is welcome
to attend.

Agenda:
– What is SparkR?
– Recent improvements to SparkR
– SparkR Roadmap
– Live Demo
– Q & A

Speaker:
Shivaram Venkataraman, Co-author of SparkR

Duration: 45-60 minutes

Cost: $0

Location: Internet

Registration:
https://attendee.gotowebinar.com/register/4761879673365920770

Estimating Analytics Software Market Share by Counting Books

Below is the latest update to The Popularity of Data Analysis Software.

Books

The number of books published on each software package or language reflects its relative popularity. Amazon.com offers an advanced search method which works well for all the software except R and the general-purpose languages such as Java, C, and MATLAB. I did not find a way to easily search for books on analytics that used such general purpose languages, so I’ve excluded them in this section.

The Amazon.com advanced search configuration that I used was (using SAS as an example):

Title: SAS -excerpt -chapter -changes -articles 
Subject: Computers & Technology
Condition: New
Format: All formats
Publication Date: After January, 2000

The “title” parameter allowed me to focus the search on books that included the software names in their titles. Other books may use a particular software in their examples, but they’re impossible to search for easily. SAS has many manuals for sale as individual chapters or excerpts. They contain “chapter” or “excerpt” in their title so I excluded them using the minus sign, e.g. “-excerpt”. SAS also has short “changes and enhancements” booklets that the developers of other packages release only in the form of flyers and/or web pages, so I excluded “changes” as well. Some software listed brief “articles” which I also excluded. I did the search on June 1, 2015, and I excluded excerpts, chapters, changes, and articles from all searches.

“R” is a difficult term to search for since it’s used in book titles to indicate Registered Trademark as in “SAS(R)”. Therefore I verified all the R books manually.

The results are shown in Table 1, where it’s clear that a very small number of analytics software packages dominate the world of book publishing. SAS has a huge lead with 576 titles, followed by SPSS with 339 and R with 240. SAS and SPSS both have many versions of the same book or manual still for sale, so their numbers are both inflated as a result. JMP and Hadoop both had fewer than half of R’s count and then Minitab and Enterprise Miner had fewer then half again as many. Although I obtained counts on all 27 of the domain-specific (i.e. not general-purpose) analytics software packages or languages shown in Figure 2a, I cut the table off at software that had 8 or fewer books to save space.

Software        Number of Books 
SAS                  576
SPSS Statistics      339
R                    240    [Corrected from: 172]
JMP                   97
Hadoop                89
Stata                 62
Minitab               33
Enterprise Miner      32

Table 1. The number of books whose titles contain the name of each software package.

[Correction: Thanks to encouragement from Bernhard Lehnert (see comments below) the count for R has been corrected from 172 to the more accurate 240.]

R Now Contains 150 Times as Many Commands as SAS

by Bob Muenchen

In my ongoing quest to analyze the world of analytics, I’ve updated the Growth in Capability section of The Popularity of Data Analysis Software. To save you the trouble of foraging through that tome, I’ve pasted it below.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/, and I collected the data for later versions following his method.

Figure 9 shows the number of R packages on CRAN for the last version released in each year. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995). The right-most point is for version 3.1.2, the last version released in late 2014.

Fig_9_CRAN — Figure 9. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). In 2014, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2014 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in
Figure 9. A program run on 5/22/2015 counted 8,954 R packages at all major repositories, 6,663 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 34.4% higher on the y-axis than the one shown in Figure 9. Therefore, the estimated total growth in R functions for 2014 was 28,260 * 1.344 or 37981.

As with any analysis software, individuals also maintain their own separate collections typically available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 20.37 functions per package. Since a program run on 5/22/2015 counted 8,954 R packages at all major repositories except GitHub, on that date there were approximately 182,393 total functions in R. In total, R has over 150 times as many commands as SAS.

I invite you to follow me here or at http://twitter.com/BobMuenchen. If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop, R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do R training on-site.

I’ve Been Replaced by an Analytics Robot

It was only a few years ago when the N.Y. Times declared my job “sexy”. My old job title of statistician had sounded dull and stodgy, but then it became filled with exciting jargon: I’m a data scientist doing predictive analytics with (occasionally) big data. Three hot buzzwords in a single job description! However, in recent years, the powerful technology that has made my job so buzzworthy has me contemplating the future of the field. Computer programs that automatically generate complex models are becoming commonplace. Rob Hyndman’s forecast package for R, SAS Institite’s Forecast Studio, and IBM’s SPSS Forecasting offer the ability to generate forecasts that used to require years of training to develop. Similar tools are now available for other types of models as well.

Countless other careers have been eliminated due to new technology. The United States previously had over 70% of the population employed in farming and fewer than 2% are farmers today. Things change, people move on to other careers. The KDnuggests web site recently asked its readers, “When will most expert-level Predictive Analytics/Data Science tasks – currently done by human Data Scientists – be automated?” Fifty-one percent of the respondents – most of them data scientists themselves – estimated that this would happen within 10 years. Not all the respondents had such a dismal view though; 19% said that this would never happen.

My brain being analyzed by the machine that replaced my brain! (Photograpy by Mike O’Neil)

If you had asked me in 1980 what would be the very last part of my job to be eliminated through automation, I probably would have said: brain wave analysis. It had far more steps involved than any other type of work I did. We were measuring the electrical activity of many parts of the brain, at many frequencies, thousands of times per second. An analysis that simply compared two groups would take many weeks of full-time work. Surprisingly, this was the first part of my job to be eliminated. However, our statistical consulting team supports many different departments, so I didn’t really notice when work stopped arriving from the EEG Lab. Years later I got a call from the new lab director offering to introduce me to my replacement: a “robot” named LORETA.

When I visited the lab, I was outfitted with the usual “bathing cap” full of electrodes. EEG paste (essentially K-Y jelly) was squirted into a hole in each electrode to ensure a good contact and the machine began recording my brain waves. I used bio-feedback to generate alpha waves which made a car go around a track in a simple video game. Your brain creates alpha waves when you get into a very relaxed, meditative state. Moments after I finished, LORETA had already analyzed my brain waves. “She” had done several weeks of analysis in just a few moments.

So that part of my career ended years ago, but I didn’t really notice it at the time. I was too busy using the time LORETA freed up to learn image analysis using ImageJ, text mining using WordStat and SAS Text Miner, and an endless variety of tasks using the amazing
R language. I’ve never had a moment when there wasn’t plenty of interesting new work to do.

There’s another aspect to my field that’s easy to overlook. When I began my career, 90% of the time was spent “battling” computers. They were incredibly difficult to operate. Today someone may send you a data file and you’ll be able to see the data moments after receiving it. In 1980 data arrived on tapes, and every computer manufacturer used a different tape format, each in numerous incompatible variations. Unless you had a copy of the program that created a tape, it might take days of tedious programming just to get the data off of it. Even asking the computer to run a program required error-prone Job Control Language. So from that perspective, easier-to-use computing technology has already eliminated 90% of what my job used to be. It wasn’t the interesting part of the job, so it was a change for the better.

Will the burgeoning field of data science eventually put itself out of business by developing a LORETA for every problem that needs to be solved? Will we just be letting our Star-Trek-class computers and robots do our work for us while we lounge around self-actualizing? Perhaps some day, but I doubt it will happen any time soon!

Stata’s Academic Growth Nearly as Fast as R’s

by Bob Muenchen

Analytics tools take significant effort to master, so once learned people tend to stick with them for much of their careers. This makes the tools used in academia of particular interest in the study of future trends of market share. I’ve been tracking The Popularity of Data Analysis Software regularly since 2010, and thanks to an astute reader, I now have a greatly improved estimate of Stata’s academic growth. Peter Hedström, Director of the Institute for Analytical Sociology at Linköping University, wrote to me convinced that I was underestimating Stata’s role by a wide margin, and he was right.

Fig_2e_ScholarlyImpactBig6

Two things make Stata’s popularity difficult to guage: 1) Stata means “been” in Italian, and 2) it’s a common name for the authors of scholarly papers and those they cite. Peter came up with the simple, but very effective, idea of adding Statacorp’s headquarter, College Station, Texas, to the search. That helped us find far more Stata articles while blocking the irrelevant ones. Here’s the search string we came up with:

("Stata" "College Station") OR "StataCorp" OR "Stata Corp" OR 
"Stata Journal" OR "Stata Press" OR "Stata command" OR 
"Stata module"

The blank between Stata and College Station is an implied logical “and”. This string found 20% more articles than my previous one. This success motivated me to try and improve some of my other search strings. R and SAS are both difficult to search for due to how often those letters stand for other things. I was able to improve my R search string by 15% using this:

"r-project.org" OR "R development core team" OR "lme4" OR 
"bioconductor" OR "RColorBrewer" OR "the R software" OR 
"the R project" OR "ggplot2" OR "Hmisc" OR "rcpp" OR "plyr" OR 
"knitr" OR "RODBC" OR "stringr" OR "mass package"

Despite hours of effort, I was unable to improve on the simple SAS search string of “SAS Institute.” Google Scholar’s logic seems to fall apart since “SAS Institute” OR “SAS procedure” finds fewer articles! If anyone can figure that out, please let me know in the comments section below. As usual, the steps I use to document all searches are detailed here.

The improved search strings have affected all the graphs in the Scholarly Articles section of The Popularity of Data Analysis Software. At the request of numerous readers, I’ve also added a log-scale plot there which shows the six most popular classic statistics packages:

If you’re interested in learning R, DataCamp.com offers my 16-hour interactive workshop,
R for SAS, SPSS and Stata Users for $25. That’s a monthly fee, but it definitely won’t take you a month to take it! For students & academics, it’s $9. I also do training on-site but I’m often booked about 8 weeks out.

I invite you to follow me on this blog and on Twitter.

Fastest Growing Software for Scholarly Analytics: Python, R, KNIME…

In my ongoing quest to “analyze the world of analytics”, I’ve added the following section below to The Popularity of Data Analysis Software:

It would be useful to have growth trend graphs for each of the analytics packages I track, but collecting such data is too time consuming since it must be re-collected every year (since search algorithms change). What I’ve done instead is collect data only for the past two complete years, 2013 and 2014. Figure 2e shows the percent change from 2013 to 2014, with the “hot” packages whose use is growing shown in red. Those whose use is declining or “cooling” are shown in blue. Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 100 articles in 2013. Going from one to five articles may represent 500% growth, but it’s not of much interest.

Figure 2e. Change in the number of scholarly articles using each software in the most recent two complete years (2013 to 2014). Packages shown in red are "hot" and growing, while those shown in blue are "cooling down" or declining. — Figure 2e. Change in the number of scholarly articles using each software in the most recent two complete years (2013 to 2014). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

The three fastest growing packages are all free and open source: Python, R and KNIME. All three saw more than 25% growth. Note that the Python figures are strictly for analytics use as defined here. At the other end of the scale are SPSS and SAS, both of which declined in use by around 25%. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

Three of the packages whose use is growing implement the powerful and easy-to-use workflow or flowchart user interface: KNIME, RapidMiner and SPSS Modeler. As useful as that approach is, it’s not sufficient for success as we see with SAS Enterprise Miner, whose use declined nearly 15%.

It will be particularly interesting to see what the future holds for KNIME and RapidMiner. The companies were two of only four chosen by the Gartner Group as having both a complete vision of the future and the ability to execute that vision (Fig. 7a). Until recently, both were free and open source. RapidMiner then started charging for its current version, leaving its older version as the only free one. Recent offers to make it free for academic use don’t include use on projects with grant funding, so I expect KNIME’s higher rate of growth to remain faster than RapidMiner’s. However, in absolute terms, scholarly use of RapidMiner is currently almost twice that of KNIME, as shown in Fig. 2b.

Google Scholar Finds Far More SPSS Articles; Analytics Forecast Updated

Only last August I wrote that among scholars, the use of R had probably exceeded that of SPSS to become their most widely used software for analytics. That forecast was based on Google Scholar searches focused on one year at a time, from 1995 through 2014. Each year from 2010 through 2014, I re-collected that entire data set just in case Google changed the search algorithm enough to affect the overall pattern. The data stayed roughly the same for those four years, but Google Scholar now finds almost twice as many articles for SPSS (at its peak year of 2008) than it found last year and 12% more articles for SAS. Changes in search results for articles that used R varied slightly with fewer in the early years and more in the latter ones. So R did not become the most widely used analytics software among academics in 2014. It’s unlikely to become so for another two years, unless present trends change.

So what happened? We’re looking back across many years, so while it’s possible that SPSS suddenly became much more popular in 2014, that could not account for lifting the whole trend line. It’s possible Google Scholar improved its algorithm to find articles that existed previously. It’s also possible that new journal archives have opened themselves up to being indexed by Google. Why would it affect SPSS more than SAS or R? SPSS is menu-driven so it’s easy to install with its menus and dialog boxes translated into many languages. Since SAS and R are much more frequently used via their English-based languages, they may not be as popular in non-English speaking countries. Therefore, one might see a disproportionate impact on SPSS by new non-English archives becoming available. If you have an alternate hypothesis, please leave it in the comments below.

The remainder of this post is the complete updated section on this topic from The Popularity of Data Analysis Software:

Scholarly Articles

The more popular a software package is, the more likely it will appear in scholarly publications as a topic or as a tool of analysis. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2014. SPSS is by far the most dominant package, likely due to its balance between power and ease-of-use. SAS has around half as many, followed by MATLAB and R. The software from Java through Statgraphics show a slow decline in usage from highest to lowest. Note that the general purpose software C, C++, C#, MATLAB, Java and Python are included only when found in combination with analytics terms, so view those as much rougher counts than the rest.

Fig_2a_ScholarlyImpact2014 — Figure 2a. The number of scholarly articles that use each software package during the most recent complete year, 2014.

From RapidMiner on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 825 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. RapidMiner, KNIME, SPSS Modeler and SAS Enterprise Miner are packages that all use the powerful and easy-to-use workflow interface, but their use has not yet caught on among scholars. BMDP is one of the oldest packages in existence. Its use has been declining for many years, but it’s still hanging in there. The software in the bottom half of this figure contain the newcomers, with the notable exception of Megaputer, whose Polyanalyst software has been around for many years now.

Fig_2b_ScholarlyImpact2014 — Figure 2b. The number of scholarly articles for software that was used by fewer than 825 scholarly articles (i.e. the bottom part of Fig. 2a, rescaled.)

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2c I’ve plotted the same scholarly-use data for 1995 through 2014, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown. This is likely due to the fact that those two leaders faced increasing competition from many more software packages than can be shown in this type of graph (such as those shown in Figure 2a).

Fig_2a_ScholarlyImpactBig6 — Figure 2c. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

Since SAS and SPSS dominate the vertical space in Figure 2c by such a wide margin, I removed those two packages and added the next two most popular statistics packages, Systat and JMP, with the result shown in Figure 2d. Freeing up so much space in the plot now allows us to see that the use of R is experiencing very rapid growth and is pulling away from the pack, solidifying its position in third place. If the current trends continue, the use of R may pass that of SPSS and SAS by the end of 2016. Note that current trends have shifted before as discussed here.

Stata has moved into fourth place, crossing above Statistica in 2014. The growth in the use of Stata is more rapid than all the classic statistics packages except for R. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another. [Note that in the plots from previous years, Statistica was displayed as a flat line at the very bottom of the graph. That turned out to be a search-related artifact. Many academics who use Statistica don’t mention the package by software name but rather say something like, “we used the statistics package by Statsoft.”]

Fig_2a_ScholarlyImpactLittle6 — Figure 2d. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after market leaders SPSS and SAS have been removed.

I’ll announce future update on Twitter, where you can follow me as @BobMuenchen.