Blog

Poll Shows Open Source Almost Even with Commercial Analytics Software

The 2012 results of the annual KDnuggets poll are in. It shows R in first place with 30.7% of users reporting having used it for a real project. Excel is almost as popular. It seems out of place among so many more capable packages, but Excel is a tool that almost everyone has and knows how to use.

It’s interesting to note that four of the top five packages used were open source. While open source packages are clearly playing a major role in analytics, people still reported using more commercial software (1086) than open source (927).

For many other ways to measure analytic software popularity, see The Popularity of Data Analysis Software. I’ve just added this graph to that article.

Will 2015 be the Beginning of the End for SAS and SPSS?

[Since this was originally published in 2012, I’ve collected new data that renders this article obsolete. You can always see the most recent data here. -Bob Muenchen]

Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. I track this trend, and many others, in my article The Popularity of Data Analysis Software. In the latest update (4/13/2012) I forecast that, if current trends continued, the use of the R software would exceed that of SAS for scholarly applications in 2015. That was based on the data shown in Figure 7a, which I repeat here:

Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.

Here is the data from Google Scholar:

         R   SAS   SPSS
1995     8  8620   6450
1996     2  8670   7600
1997     6 10100   9930
1998    13 10900  14300
1999    26 12500  24300
2000    51 16800  42300
2001   133 22700  68400
2002   286 28100  88400
2003   627 40300  78600
2004  1180 51400 137000
2005  2180 58500 147000
2006  3430 64400 142000
2007  5060 62700 131000
2008  6960 59800 116000
2009  9220 52800  61400
2010 11300 43000  44500
2011 14600 32100  32000

ARIMA Forecasting

We can forecast the use of R using Rob Hyndman’s handy auto.arima function to forecast five years into the future:

> library("forecast")

> R_fit <- auto.arima(R)

> R_forecast <- forecast(R_fit, h=5)

> R_forecast

   Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
18          18258 17840 18676 17618 18898
19          22259 21245 23273 20709 23809
20          26589 24768 28409 23805 29373
21          31233 28393 34074 26889 35578
22          36180 32102 40258 29943 42417

We see that even if the use of SAS and SPSS were to remain at their current levels, R use would surpass their use in 2016 (Point Forecast column where 18-22 represent years 2012 -2016).

If we follow the same steps for SAS we get:

> SAS_fit <- auto.arima(SAS)

> SAS_forecast <- forecast(SAS_fit, h=5)

> SAS_forecast

   Point Forecast     Lo 80   Hi 80    Lo 95 Hi 95
18          21200  16975.53 25424.5  14739.2 27661
19          10300    853.79 19746.2  -4146.7 24747
20           -600 -16406.54 15206.5 -24774.0 23574
21         -11500 -34638.40 11638.4 -46887.1 23887
22         -22400 -53729.54  8929.5 -70314.4 25514

It appears that if the use of SAS continues to decline at its precipitous rate, all scholarly use of it will stop in 2014 (the number of articles published can’t be less than zero, so view the negatives as zero). I would bet Mitt Romney $10,000 that that is not going to happen!

I find the SPSS prediction the most interesting:

> SPSS_fit <- auto.arima(SPSS)

> SPSS_forecast <- forecast(SPSS_fit, h=5)

> SPSS_forecast

   Point Forecast   Lo 80 Hi 80   Lo 95  Hi 95
18        13653.2  -16301 43607  -32157  59463
19        -4693.6  -57399 48011  -85299  75912
20       -23040.4 -100510 54429 -141520  95439
21       -41387.2 -145925 63151 -201264 118490
22       -59734.0 -193590 74122 -264449 144981

The forecast has taken a logical approach of focusing on the steeper decline from 2005 through 2010 and predicting that this year (2012) is the last time SPSS will see use in scholarly publications. However the part of the graph that I find most interesting is the shift from 2010 to 2011, which shows SPSS use still declining but at a much slower rate.

Any forecasting book will warn you of the dangers of looking too far beyond the data and I think these forecasts do just that. The 2015 figure in the Popularity paper and in the title of this blog post came from an exponential smoothing approach that did not match the rate of acceleration as well as the ARIMA approach does.

Colbert Forecasting

While ARIMA forecasting has an impressive mathematical foundation it’s always fun to follow Stephen Colbert’s approach: go from the gut. So now I’ll present the future of analytics software that must be true, because it feels so right to me personally. This analysis has Colbert’s most important attribute: truthiness.

The growth in R’s use in scholarly work will continue for two more years at which point it will level off at around 25,000 articles in 2014.This growth will be driven by:

  • The continued rapid growth in add-on packages (Figure 10)
  • The attraction of R’s powerful language
  • The near monopoly R has on the latest analytic methods
  • Its free price
  • The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).

What will slow R’s growth is its lack of a graphical user interface that:

  • Is powerful
  • Is easy to use
  • Provides journal style output in word processor format
  • Is standard, i.e. widely accepted as The One to Use
  • Is open source

While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its capabilities. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, Red-R, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used package.

The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos.  For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and  even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use could not decide what to teach so they continued teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a good GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.

The use of SPSS for scholarly work will decline only slightly this year and will level off in 2013 because:

  • The people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
  • The people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
  • The people who needed a more advanced GUI have already switched to JMP

The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.

Stata’s growth will level off in 2013 at level that will leave it in fourth place. The other packages shown in Figure 7b will also level off around the same time, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.

The future of Enterprise Miner and SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes.

So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to follow the detailed blog at Librestats to collect your own data from Google Scholar and do your own set of forecasts. Or simply go from the gut!

Things I’ve Learned About WordPress.com

I’m almost done moving this site from Google Sites to WordPress. This post describes some of the things I’ve learned about WordPress.com.

By default, WordPress.com makes your site look like a blog. I prefer it look like a website that contains a blog. You can change that in Site Admin under Settings> Reading. The Front Page Displays box determines what people will see when they arrive. By default, that’s your latest blog entry. You can change that to any page you like.

WordPress allows a very limited set of files to download. My book support files are in R, SAS, SPSS, Stata, sas7bdat, etc., so I zip them up into a single file. Since WordPress.com does not allow you to distribute zip files, I had to put them in my DropBox public folder and link to them from WordPress.com.

You can organize your menus by either parent/child relationships among pages or by using custom menus. Custom menus have the advantage of allowing all pages to be at the root of your site, keeping nice short URLs like “https://r4stats.com/popularity”. However, many site templates do not support custom menus, so you are very constrained in your choice of templates. Using the “popularity” article as an example, I created a page called “Articles” and let it be the parent. So now the URL is “https://r4stats.com/articles/popularity”. That’s too bad since there are old links out there that use the short version. I’ll put notes on them to redirect people. I could use “redirects,” but I would prefer people to see the links and note the changes. That will address the many links that used “https://sites.google.com/site/r4statistics/” rather than the shorter equivalent, “http:r4stats.com”.

One of the most frustrating problems I saw resulted from WordPress.com not letting you change the URL to the one you wanted. For example, I wanted my Miscellaneous page to have the URL http://r4stats.wordpress.com/misc, but it insisted on adding a “-2” to the end of it as in http://r4stats.wordpress.com/misc-2. I looked all over for a page that already used the “misc” link but found none. Then it occurred to me that it might be in the trash. It was. When I deleted it, then it would allow me to reuse the simpler link.

I spent a crazy amount of time figuring out how to get my program code examples to display in Courier or any similar monospaced font. I paid the $30 fee to edit my cascading style sheet (CSS), only to find that allowed me to change the whole page to Courier. I finally found that you click the upper-right-most button on the toolbar labeled Show/Hide Kitchen Sink. That made a style menu appear. It contains a style named “Preformatted,” which is monospaced. Tal Galili helpfully pointed me to a wonderful article he had written on how to make R code look nice on WordPress.

So that’s the sum of my lessons so far. On the whole, I preferred the website tool at Google Sites, but I do like having WordPress blogs built right into the site. That makes it easy to hook the blogs into R-Bloggers.com and PROC-X.com.

Trying Out WordPress

I’ve had my site http://r4stats.com on Google Sites for a few years now and it’s time to try something new. Most of the articles there are not very blog-like. For example, The Popularity of Data Analysis Software is an article that I update many times a year. If you’ve never read it before, it would probably interest you. But after one thorough reading, only the major changes will be of interest. So I’m contemplating writing short blog articles of the major changes while maintaining the whole article for its completeness.

So I’ll be fiddling around here until I decide if I can really  blend these two ideas in a blog format.