Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. I track this trend, and many others, in my article The Popularity of Data Analysis Software. In the latest update (4/13/2012) I forecast that, if current trends continued, the use of the R software would exceed that of SAS for scholarly applications in 2015. That was based on the data shown in Figure 7a, which I repeat here:
Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.
Here is the data from Google Scholar:
R SAS SPSS 1995 8 8620 6450 1996 2 8670 7600 1997 6 10100 9930 1998 13 10900 14300 1999 26 12500 24300 2000 51 16800 42300 2001 133 22700 68400 2002 286 28100 88400 2003 627 40300 78600 2004 1180 51400 137000 2005 2180 58500 147000 2006 3430 64400 142000 2007 5060 62700 131000 2008 6960 59800 116000 2009 9220 52800 61400 2010 11300 43000 44500 2011 14600 32100 32000
ARIMA Forecasting
We can forecast the use of R using Rob Hyndman’s handy auto.arima function to forecast five years into the future:
> library("forecast") > R_fit <- auto.arima(R) > R_forecast <- forecast(R_fit, h=5) > R_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 18258 17840 18676 17618 18898 19 22259 21245 23273 20709 23809 20 26589 24768 28409 23805 29373 21 31233 28393 34074 26889 35578 22 36180 32102 40258 29943 42417
We see that even if the use of SAS and SPSS were to remain at their current levels, R use would surpass their use in 2016 (Point Forecast column where 18-22 represent years 2012 -2016).
If we follow the same steps for SAS we get:
> SAS_fit <- auto.arima(SAS) > SAS_forecast <- forecast(SAS_fit, h=5) > SAS_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 21200 16975.53 25424.5 14739.2 27661 19 10300 853.79 19746.2 -4146.7 24747 20 -600 -16406.54 15206.5 -24774.0 23574 21 -11500 -34638.40 11638.4 -46887.1 23887 22 -22400 -53729.54 8929.5 -70314.4 25514
It appears that if the use of SAS continues to decline at its precipitous rate, all scholarly use of it will stop in 2014 (the number of articles published can’t be less than zero, so view the negatives as zero). I would bet Mitt Romney $10,000 that that is not going to happen!
I find the SPSS prediction the most interesting:
> SPSS_fit <- auto.arima(SPSS) > SPSS_forecast <- forecast(SPSS_fit, h=5) > SPSS_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 13653.2 -16301 43607 -32157 59463 19 -4693.6 -57399 48011 -85299 75912 20 -23040.4 -100510 54429 -141520 95439 21 -41387.2 -145925 63151 -201264 118490 22 -59734.0 -193590 74122 -264449 144981
The forecast has taken a logical approach of focusing on the steeper decline from 2005 through 2010 and predicting that this year (2012) is the last time SPSS will see use in scholarly publications. However the part of the graph that I find most interesting is the shift from 2010 to 2011, which shows SPSS use still declining but at a much slower rate.
Any forecasting book will warn you of the dangers of looking too far beyond the data and I think these forecasts do just that. The 2015 figure in the Popularity paper and in the title of this blog post came from an exponential smoothing approach that did not match the rate of acceleration as well as the ARIMA approach does.
Colbert Forecasting
While ARIMA forecasting has an impressive mathematical foundation it’s always fun to follow Stephen Colbert’s approach: go from the gut. So now I’ll present the future of analytics software that must be true, because it feels so right to me personally. This analysis has Colbert’s most important attribute: truthiness.
The growth in R’s use in scholarly work will continue for two more years at which point it will level off at around 25,000 articles in 2014.This growth will be driven by:
- The continued rapid growth in add-on packages (Figure 10)
- The attraction of R’s powerful language
- The near monopoly R has on the latest analytic methods
- Its free price
- The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).
What will slow R’s growth is its lack of a graphical user interface that:
- Is powerful
- Is easy to use
- Provides journal style output in word processor format
- Is standard, i.e. widely accepted as The One to Use
- Is open source
While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its capabilities. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, Red-R, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used package.
The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use could not decide what to teach so they continued teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a good GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.
The use of SPSS for scholarly work will decline only slightly this year and will level off in 2013 because:
- The people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
- The people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
- The people who needed a more advanced GUI have already switched to JMP
The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.
Stata’s growth will level off in 2013 at level that will leave it in fourth place. The other packages shown in Figure 7b will also level off around the same time, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.
The future of Enterprise Miner and SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes.
So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to follow the detailed blog at Librestats to collect your own data from Google Scholar and do your own set of forecasts. Or simply go from the gut!
I think the analysis is both simplistic, and leaves out a very important competitor: Mathworks MATLAB. What SAS and SPSS as well as MATLAB bring are standardly configured and powerful computational capability of standard quality. This is important in regulated industries. Also, despite the great work Revolution and others are doing to endow R with capabilities for concurrent processing, the open source world, IMO, operates on fetishes. Like many fetishes, these are not necessarily well reasoned. A solution which pushes lots of data around to feed Python or PERL may not support the expression-level concurrency needed for fast and vast numerical vector operations. They also ignore other kinds of increasingly available concurrency, such as GPUs. (See current issue of COMPUTING IN SCIENCE AND ENGINEERING, IEEE Computer Society, 14(3).) Finally, SAS is entrenched in much medical work. MATLAB is hugely respected by many engineers. R is very good, but there are many reasons why a single provider of a computational engine can do better than an open source bazaar. Think Macintosh:PC::MATLAB:R.
I would love to be able to include MATLAB in the Popularity article. However that paper focuses on software used for statistical analysis, predictive analytics, etc. and I believe that is a very small proportion of MATLAB’s market share. You can see that from their list of toolboxes. Here at The University of Tennessee our group supports areas including statistics, mathematics, engineering and supercomputing. I’ve worked with thousands of people using every statistics package on the market but perhaps only two or three using MATLAB to do that type of analysis. UT is certainly not a random sample, but if we could get one, I’d be surprised if a major portion of MATLAB use went to statistical analysis.
Regarding SAS’ entrenchment, I agree it is quite entrenched in many fields and will take years for R to make progress against even if these academic trends continue. People love the abstract concept of progress but resist actually changing their way of life!
That is interesting. I have just started to learn R and use RStudio, and I am a fan. It is much more sophisticated than what I expected.
It would be great to see the counts of purchases of each software for each year. This is perhaps possible from financial reports of the companies, maybe? I think banks and pharmaceutical firms use SAS as it comes with warranty and they are so entrenched with the systems. Kind of like the stranglehold Mircosoft have. I think SAS make sure their clients remain happy through proving traning and support to justify the bills.
If I could get actual sales figures of SAS/Stat, SPSS Statistics, etc. I would certainly include them in my yearly updates to the Popularity article. However, companies won’t provide them. Annual reports show total sales but especially now that SPSS is just a tiny piece of IBM revenue, that’s not very helpful. SAS and Statacorp are privately held and don’t provide such information. R downloads are not readily available. Even if they were, one download might end up in 1,000 lab machines (as happens here at UT) and no doubt thousands of people have downloaded R but never got around to using it much. It would be great if R had code in it to optionally report IP addresses at startup so we could get a better estimate of use. So far though, nothing like that exists. So I just try to estimate market share from as many angles as I can in that article.
I agree that the companies do all they can to keep their customers happy. SAS remains very customer driven via the yearly SASware Ballot. I still use SAS and SPSS regularly when my clients prefer it. They’re top-quality packages that get better with every release.
Yes, impossible to measure really. Thanks for the blog, I am new to R and getting a lot of of the community on the net.
i also wonder if you consider uncertainties in your forecasting
You can see both the 80% and 95% confidence intervals in the auto.arima output shown (e.g. Lo 95, Hi 95).
With data which are counts, I am always tempted to take the log before doing an analysis. It makes my error more nice and ensures the data are all positive. Six to seven thousand publications for SAS and SPSS in five years does not sound unreasonable. 150 thousand for R seems a lot, but given the data for SPSS in 2005, well, maybe.
I would also like to see the same statistic trended for MATLAB …
On the one side I agree that the analysis is simplistic, while on the other I think that considering Matlab for statistics is a bit of a joke. Matlab clearly has a strong following in engineering, but its statistics toolboxes are a toy compared with the coverage and depth that can be accessed from R. In addition, Matlab also has a series of numerical and design issues (see here for example http://abandonmatlab.wordpress.com/) and has no near the support for the diversity of computer environments that you can get in R (I’d be happy to have a decent mac version, for example).
Now, if you wanted an opensource alternative in scripting languages, you could get good numerical performance for stats in Python (using NumPy + Pandas, probably faster than R, but with a much smaller coverage of methodologies, slowly increasing) and even get access to GPU using Theano.
SAS’s entrenchment in medical statistics probably comes from (i) historical reasons (it has been ‘out there’ for a very long time) and (ii) the silly believe that FDA *requires* SAS, which is patently false.
I only know of one statistician that regularly uses Matlab, while I know many people using R, SAS, Stata, SPSS, Genstat, etc. The one using Matlab has started switching to R too.
Tool usage seems to be terribly field-specific; unfortunately for me MATLAB actually is the standard for statistical analysis in neurophysiology, phychophysics, fMRI, etc.
Links without further comment:
http://www.agr.unideb.hu/~huzsvai/R/Marques.pdf
http://www.mathtools.net/MATLAB/Statistics/index.html
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2917119/
Did you mean to compare that list of 47 MATLAB tools for statistics with the 5,300 ones for R?
I wasn’t counting noses. I was simply suggesting that MATLAB is not the statistics wasteland that was implied in a post. I also believe that while R has important packages and contributions for both spatial statistics and for time series, MATLAB has more comprehensive ones. (Consider MATLAB’s http://www.mathworks.com/help/toolbox/map/f7-12036.html.) Their documentation is better.
I write all this as primarily an R user, and I heartily acknowledge R is an exciting ecosystem.
I also think it is important to separate the undertow which ladens these kinds of discussions out. There will be people who will oppose MATLAB, SAS, SPSS, and even S simply because these are not open source. In the numerical and quantitative worlds, it is not granted and given that open source tools are inherently superior. (Consider FORTRAN compilers for example.) I think that is a completely different discussion from whether or not R is more popular or useful than say, SAS. It is also unfair, as their is also an open source ecosystem for MATLAB. I don’t know if there are comparable ones for SAS or SPSS.
Also, I feel that using a measure of a language for teaching to predict adoption in the commercial marketplace is a huge leap. I know getting numbers on the latter is hard, but that doesn’t make the teaching measure any better.
Finally, there are several R packages, such as fda, which have implementations both in R and in MATLAB, and their MATLAB versions are better.
I saw the earlier article, and meant to comment that I think the whole premise is faulty. Why should google scholar hits be presumed to be a marker for the popularity of a given software? In general, the world is a lot wider than academia. Even within academia, there are many people who don’t publish their methods, or don’t even mention the software they used.
Leaving that aside, I would say that like Apple afficianados, R users tend to be more exclusive of other products and more interested to promoting the advantages they perceive. Whether or not these exist to the degree they see them is another question. But it does mean that R devotees are more likely to mention the software (and indeed to publish about it) than, e.g., SAS users.
Each measure that is publicly available has its problems. That is why I report on every one I can find in the original article. When I look at all those measures, I can’t help but think the use of R is growing at high speed. While the world is far wider than academia, academia is where people learn statistics. I would love to have data on the percent of people still using the first stat package they learned in college. I’ll bet it’s well over half. Still, as I mentioned, I don’t expect the use of R to catch that of SPSS even in academia unless R gets a GUI that’s as comprehensive as the one SPSS has. Your suggestion that R users may report their use more than users of other software is possible, but why would SAS and SPSS users become less likely to report their use over time?
This might actually be driven by a trend where publications are less likely to mention their data analysis program.
What the author needs is one more line expressing the either the total number of papers reporting statistical analyses or the subset of those papers that do not report any statistical program.
The increase in reported use of R, (or systat, etc) do not account for the drop in SPSS and SAS. So, are there fewer papers with statistical analyses? Does google scholor have a 5 year lag in their search results? Or are more papers just saying, ‘we ran a 2×2 ANOVA’ without crediting SPSS?
TL;DR – sometimes statisticians need to think more like scientists
(taken from the top comment on reddit: http://www.reddit.com/r/statistics/comments/torf8/will_2015_be_the_beginning_of_the_end_for_sas_and/)
Good questions.
Q: The increase in reported use of R, (or systat, etc) do not account for the drop in SPSS and SAS. So, are there fewer papers with statistical analyses?
A: Yes, there are fewer papers. Since Google Scholar focuses on academic papers which are, in turn, driven by government grants, there are fewer papers that cite any of these packages. I assume fewer papers of all kinds, but I don’t have data on that. 2006 was the peak year in this data. If this had not happened, the decline in SAS and SPSS would not have been as fast but the rise of their competitors would be even faster. The relative position of each package within a year would probably have been the same, though its easy to come up with hypotheses that would suggest otherwise.
Q: Does google scholor have a 5 year lag in their search results?
A: As soon as a journal is available electronically, Google Scholar seems to have it. I usually update the main article, http://r4stats.com/articles/popularity, in late March or early April when the previous year’s data is complete.
Q: Or are more papers just saying, ‘we ran a 2×2 ANOVA’ without crediting SPSS?
A: Possible, but why would SAS and SPSS people start doing this but not R, Stata, JMP, etc. users not follow this trend?
I’ve got a very long list of more analyses I would like to do and much more data to collect. My limitation, I hope, is not an inability to “think like a scientist” as you suggest, but rather the time it takes to do all the work!
To the degree students of quantitative subjects follow the new ethic of publishing both data and computer codes used to derive results as well as their papers, it should be possible to monitor progress and development of relative adoption rates. In the case of commercial software, the same authors would need to carefully specify the version of the software as well as the host.
Another dimension to all this is that to the degree some of the decentralized processing systems like Hadoop or Cassandra are used to process data, I wonder if anyone has addressed the issue that there will be some variation in the actual results due to non-commutivity of calculations with finite precision arithmetic when done on several processors. For example, when Revolution’s multithread small scale concurrency run R, do they have some kind of “tolerance fuzz” reported which bounds these effects?
a lot of good articles on SAS langauge and platform go in their SUGI or SAS Annual Conference Papers. Would that be counted as academic work by Google Scholar.
That’s a very interesting question Ajay. I held my breath as I did my first search. Yes, sas.com is indexed by Google Scholar and it found lots of SUGI and Global Forum papers.
I’d like to say it is a good will more than analysis, for R users.
More R-users would be great but don’t agree that SAS and SPSS would be less popular.
The analysis is distorted because of the different norms of methods writing across disciplines. Economics papers are unlikely to cite statistical software unless its about developing a new estimator and even then it might just say that they have R or Stata code available. The slew of SPSS citations likely come from medicine and psychology where they have to cite whatever software they used.
I agree that could bias the results in a given year. But how would that influence the trends across time?
One important point that your forecast doesn’t take into account is the influence of fanatic R bloggers. I have been using SAS for quite some time and i have rarely come across blogs which passionately talk about SAS
However, the case with R is completely opposite and the blog count just keeps going up. If we can add this factor into the forecast then i am sure the decline will start even sooner.
Best
Rebecca
Very thought provoking. It should be pointed out that JMP does not “run SAS.” It is a wholly distinct product, as is clear from its history. SAS co-founder John Sall was fascinated by the new Macintosh interface and decided to develop software to exploit it. The software became known around the SAS water coolers as “John’s Macintosh project.” That was quite serendipitous for SAS because when they first released JMP in 1989, the marketing almost wrote itself.
JMP has the ability to run completely stand-alone as you point out. However, it also integrates tightly with SAS. Here are a couple of links to SAS documents that show a number of ways that you can now use JMP to “run SAS”:
http://www.jmp.com/software/pdf/103789_sas_programming.pdf
http://www.jmp.com/academic/pdf/learning/12using_jmp_to_generate_sas_programs.pdf
http://www.jmp.com/academic/pdf/learning/12entering_and_running_sas_programs.pdf
Why do you think STATA’s growth will level off?
To clarify, I think they’ll all level off. I think Stata will level off at a lower point than SAS, SPSS and R for a number of reasons. SAS and SPSS have a dominant position now, and they advertise heavily compared to Stata. Plus they don’t have the data-in-memory requirement that Stata does. Stata and R share a number of important attributes, such as being easily extensible. As a result, R may be more of a threat to Stata than to the other two.
SAS’s great strength for those data mining practitioners with many years of experience is its ability to “pound ” all the data together into a meaningful analytical file. This is the reason why so many of are biased towards SAS. In a way we feel limitless in terms of what we can do as long as we have data. Still trying to discover this capability in “R” but with great difficulty.
Books do seem to be rather sparse on the complex data management tasks that are such important prerequisites to analysis. I cover those steps extensively in R for SAS and SPSS Users and R for Stata Users. However, SAS does have the ability manage massive amounts of data. At the moment Revolution Analytic’s version of R is the only one that can handle terabytes of data.
Richard, we so appreciated this comment and would love your feedback on our PR approach for SAS and Open Source R. Customers like you are quite influential with R bloggers and tweeters and those reading the posts. We’ve chosen not to enter the R fray publically since we see SAS and R as complimentary. Is there more you’d like to say on SAS’s behalf, the next time we announce SAS news or post blogs on the topic? I ask because of your experience, and clear communication on the topic.
In your research you newer mentioned PSPP – free SPSS clone. A statistics of it using can add up to SPSS popularity. The latest release looks much stable than before.
I have high hopes for PSPP. I looked at it a few months ago and its ANOVA routines were limited to oneway. That’s a deal breaker for me. I wish the project well though.
Interesting article, R has definitely grown in popularity.
Something else to consider would be IBM acquiring SPSS Inc, therefore SPSS no longer refers to a singular product, rather a suite of products, including ‘SPSS Statistics’, which was formally known as ‘SPSS’
Yes, you could hypothesize that the leveling out of SPSS in 2011 was due to more products. The list of all 23 SPSS products is here:
http://www-01.ibm.com/software/analytics/spss/products/statistics/
and I think that only Bootstrapping, Data Prep and Neural Networks were added since IBM bought them. But I doubt that just 3 new products is likely to account for the shift. My best guess is that competition accounted for the main drop since 2005 but that SPSS’ excellent GUI and devoted user following was bound to get that to level off eventually. It’s R’s lack of such a good GUI that I think will get its growth to level off eventually. There are just way more non-programmers who use statistics than there are programmers.
It will be interesting to see if IBM can get more out of all their recent analytics purchases by getting them all to work well together. If they do, that could also get the curve to level off or even start to climb again.
I think you misunderstood my point.
SPSS (the stats package) is no longer called just SPSS, it was renamed to PASW Statistics and then to SPSS Statistics, in an attempt to help differentiate between SPSS the company and SPSS the product. IBM retained SPSS for the brand, however they did so for all SPSS products, eg SPSS Data Collection, SPSS Modeler (which are separate applications all together)
With any name change, it is difficult to get everyone to change how they refer to the new name, but thought it worth you being aware if you were to redo this analysis in the future.
I’m well aware of the names of SPSS products and how they’ve changed over the years (http://r4stats.com/misc/spss-catalog/). If you try out Google Scholar yourself, I think you’ll find that those products that include “SPSS” in their names but which are not SPSS Base or its add-on modules account for a tiny fraction of the hits.
Details of how we did this search are at:
http://librestats.com/2012/04/12/statistical-software-popularity-on-google-scholar/
There you can replicate our steps exactly while trying any combinations you like.
Running the programs is easy but dredging through the resulting queries takes a very long time to see what all comes up for strings like “PASW”. To quote that other blog post, “We looked at PASW excluding SPSS and got a small number of messy hits. They included Plant Available Soil Water and Pluent Abdominal Segment Width.”
One thing which would be interesting to enter into this are some of the other powerhouses that Rexer has been pointing out: Statistica, RapidMiner, KNIME, Weka, Salford Systems. These all, according to the Rexer survey (despite its admitted quirks) appear to demonstrate a more level playing field between them and the frighteningly large elephant in the room.
My experiential bias would lead me to posit a number of elements which would affect the results of your question. 1) Ease-of-use, 2) Total Cost of Ownership, 3) Extensibility, 4) Scalability 5) Rapid-turnover of results
WRT to #1 – I think there has been an interesting emergence of self-titled ‘data-ists’ (scientists, analysts, you name it), who could be following leads to get ‘feet-in-the-door’ as there is not really a shortage of these jobs, just a shortage of folks with the skill sets needed to impact a business. I’ve noticed an increasing number of these ‘ists’ in the workplace. The hiring managers have no idea what they really need, and the candidates who are grossly inflating their expertise in the field often promote little more than fluffy reports. This is not to say a portion of this cohort cannot or is not willing to learn advanced analyses however. But, in order to build a program internally where more direct access to methods is available (assuming they understand the underlying reasons of *why* they’re doing what they’re doing conceptually), GUI flavored instruments seem to have been given more preference. Now whether or not this kind of shift in the analytics space of some institutions is simply dangerous, or whether it is an opportunity – given leaders who are expert in the field within said institution.
WRT to #2: well, cost is more and more a factor. In some cases, I’ve seen a ‘comfort’ in choosing based on ‘brand’. I make this somewhat equivalent to a choice between buying 2 shoes, which are otherwise exactly alike in in form, function, and service, but differ by a 3x price. One may be more ‘comfortable’ with purchasing the more costly shoe, not realizing the potential stream of effects down the road. For those who have become more savvy in this arena – the ‘bling’ and the ‘novelty’ has begun to wear off, and the actual costs for producing meaningful information is now being monitored more closely (i.e analytic governance teams, etc.). When the ‘value-added’ of a platform then becomes realized, folks begin to think about alternatives. Which BTW makes me wonder why you didn’t consider Statistica in your analysis, given Rexer’s results.
WRT #3: Well, for each platform, there are features we may all like, and one’s we may wish were there. Where we are looking for something extra, or perhaps novel, we supplement where an analytics platform is then alone a constraint. But what about the analytic lifecycle and workflow? Does Platform ‘j’ accept or output well to your extension? The more proprietary the analytics engine and the more narrow the field of view it has with other software packages, the less natively extensible a platform is, which might otherwise force someone to plunk down ALOT of extra cash for yet another proprietary package, the less the more forward looking analysts, scientists, and savvy managers will be willing to lock themselves in to what may constitute a horrendous capital expenditure with minimal, if not negative ROI.
WRT #4: There are different grades of maturity with analytics groups across institutions of any sort. One indicator is the breadth and scope of data an institution captures. For those that are more mature, data capture and analyses often bring up more questions, which kickstarts a cycle of more robust data collection. As the data get bigger, traditional approaches to analysis fail to work. In order to handle this, big companies like SAS begin shoving customers to proprietary SAS warehouses. Certainly a great idea, but not great for everyone. With the advent of distributed storage and computing available in the ‘smaller’ guys as well as the giants – the ROI for big data work becomes even bigger for SAS/SPSS competitors.
Finally WRT #5 – This relates to everything noted above, but for me is a bit of a confound in how institutions are choosing analytic platforms. Immature analytics environments almost entirely rely upon reporting, reporting, and more reporting (I”m thinking of all the papers flying around the WallStreet of yesteryear. As they begin to grow (if they do) some advanced analytics capabilities, the executive expectation would be the speed of research to produce results as if it were a simple report. Simply put, the shift to advanced analytic methods can initially cause dissonance between the analyst and the business due to generally temprorary protracted timelines for meaningful output. But rapid-turnover for these budding groups can be sought a number of ways – and GUI-based, or otherwise easy to use interfaces helps guide this. Although code based systems aids in the understanding of why someone is doing what they do (and I am thus an advocate of thing like open source R, Python, SAS, and others). To add the ramp up time of learning a coding environment to analysts who are more in need of understanding the conceptual matter of the what, how, and why’s of the analysis itself has been in my experience an effort in futility. For those that bought the Manolo Blahnik analytics platform, they may indeed find that their system has become reduced to a critically pricey reporting tool for the ease-of-use and rapid-turnover expectations mentioned above.
This is where the veteran data-ists ( the real McCoys – statisticians, analysts, scientists, etc) really have to continue to maintain industry savvy and objectivity with impacting the business, which may often entail putting on the proper sunshades to filter some of the ‘bling’ of the million dollar shoes.
Hi,
I am from the industry, I work for a fortune 500 company, we use SAS for CRM, Warranty, and other business analytic solutions, I have been working on such project at both support and maintenance level for past 10 years My assessment is as follows :
1.) Companies ( The bigger ones) dont choose software packages for technical reason, but mainly for post production support, upgrade etc, which is big headache for IT top brass going open source might make them feel very insecure.
2.) This trends has seen a change recently where due to recession huge IT budget cuts, and layoffs have forced younger managers moving towards top position and have raised new CR’s and new project to be implemented in open source ie R
3.) Open source is not a evil that corporate scared off but they need a good Enterprise backing for open source to adapt it such as Java backing by IBM or Oracle is making big in Enterprise business solutions.
4.) If R can co-exist with Java and existing SAS implementations such that newer CR and other functionality can be worked in R then R would slowly seep into the enterprise data-centre
5.) A big enterprise arm support is needed for R, where there is a quite a backing for patches and support, and good marketing then R will replace SAS and other commercial packages. ( Like IBM pushing Java through 2K to 2010)
6.) R memory model isnt a big problem as stated by others, as enterprise solutions generally use a DB SQLServer, Teradata or Oracle, thus if R can improve its plug-in for managing data with such big databases, then a seamless ingratiation would ensure R as a leader in Analytic s solutioning.
Hi Amumathii,
Thanks for your thought provoking comments. My responses are below.
Cheers,
Bob
1.) Companies ( The bigger ones) don’t choose software packages for technical reason, but mainly for post production support, upgrade etc, which is big headache for IT top brass going open source might make them feel very insecure.
Bob: I agree that in a production environment, where downtime can quickly become expensive, support is important and worth paying for. Companies such as Revolution Analytics and XL-Solutions provide that type of support (Disclosure: I work with Revolution Analytics teaching workshops). Other companies such as Oracle, SAP and Teradata are integrating R into their tools and I assume they provide support as well.
I’ve spoken with companies that were more interested in getting rid of the yearly license code renewal headache than they were in saving the cost of license fees. That came as quite a surprise.
2.) This trends has seen a change recently where due to recession huge IT budget cuts, and layoffs have forced younger managers moving towards top position and have raised new CR’s and new project to be implemented in open source ie R
Bob: Our IT budget has been cut repeatedly in the last few years and it has definitely led to such discussions. While it’s hard to completely eliminate older proprietary solutions, it’s much easier to use R to replace add-on modules.
3.) Open source is not a evil that corporate scared off but they need a good Enterprise backing for open source to adapt it such as Java backing by IBM or Oracle is making big in Enterprise business solutions.
Bob: I agree. See my comments on 1) above.
4.) If R can co-exist with Java and existing SAS implementations such that newer CR and other functionality can be worked in R then R would slowly seep into the enterprise data-centre.
Bob: R is definitely seeping into lots of companies. Here’s a list of companies whose employees have attended my workshops, which focus on using R in combination with or to replace SAS, SPSS and Stata:
http://r4stats.com/workshops/workshop-participants/
It’s fairly easy to call R from SAS as I describe here:
http://r4stats.com/articles/calling-r/
5.) A big enterprise arm support is needed for R, where there is a quite a backing for patches and support, and good marketing then R will replace SAS and other commercial packages. ( Like IBM pushing Java through 2K to 2010)
Bob: The only comprehensive one that’s on the market now is Revolution Analytics’ version of R. A table that compares the features of their version to the standard version is here: http://www.revolutionanalytics.com/why-revolution-r/which-r-is-right-for-me.php.
6.) R memory model isnt a big problem as stated by others, as enterprise solutions generally use a DB SQLServer, Teradata or Oracle, thus if R can improve its plug-in for managing data with such big databases, then a seamless ingratiation would ensure R as a leader in Analytic s solutioning.
Bob: I agree that concerns with R’s memory limitation have been overblown. Statistics is very good at generalizing from a sample of just a few thousand to a population. Companies that have millions or even billions of records store them in a database that is quite capable of selecting an appropriate sample for use with R.
In addition, there are several efforts underway that break the memory barrier including the Big Memory Project (http://www.bigmemory.org/), Programming with Big Data (http://thirteen-01.stat.iastate.edu/snoweye/pbdr/) and Revolution R Enterprise (http://www.revolutionanalytics.com/products/enterprise-big-data.php).
Thanks for the very interesting article and thread. I have a small company that does charts and statistics on local real estate markets and am looking to add forecasting capability. Have used R to investigate the data and develop models.
The issue is how to integrate R into a Microsoft .Net infrastructure. There are a couple of free/inexpensive approaches (R.Net and R(D)COM) that seem iffy. Revolution could do the job (Enterprise version supports Json integration), but the budget does not support it. Right now, Statconn is looking like the best option.
In any case, the article and thoughtful comments and responses have been helpful.
The problem at hand is multidimensional ; the world is complex : academia is only one part of it.
I use the abbrev. SAS for the system and SI for SAS Institute, J.G. for Dr. Jim Goodnight.
1. SI never acted as to terminate competitors, so it left Minitab and Stata, etc in their niches and set out on a growth-path of its own (revenues ; employees; etc).
People like to work at SI. Customers of SI show high satisfaction rates.
People developing R-packages draw their compensation from other sources.
If you set your curves and the SI revenue figures side by side (see http://www.sas.com/company/about/statistics.html), you can’t discern a visible effect of R-growth onto SI revenues.
I know the history (I have done real work with BMDP, APL, SAS, SPSS, SYSTAT, etc
But what is SI/SAS today ? Clearly not a statistics-package -developer and -vendor.
As such you cannot ‘feed’ 13000 employees.
Anecdote : Last week I talked to a student of mine who works at a large bank and he told me that his colleagues in ‘Finance’ use SAS on a daily basis for data collection and transfers but were astonished to hear that you can do multivariate stats with SAS.
In finance worldwide some traders use MS-Excel sheets for decision and their gut-feeling. There are resources on the web on Statistics and Excel (need not be repeated here).
I do not know what kind of hardware and stats-software is being used in computerized milli-seconds trading (I would suspect it’s not SAS).
However when either of the two banking operations just has lost a billion or two (happens regularly) the bank or financial conglomerate will easily invest some millions in new hardware and some hundredthousands into SAS software such as
SAS Risk Dimensions
SAS Risk Management for Banking
SAS Risk Management for Insurance .
Those quant guys (financial statisticians) that use R to find a new sell/buy strategy will not be asked and are not involved until the new strategy’s implementation has to be back-tested.
Then the whole cycle repeats and repeats …………
I could describe similar scenarios from the pharmaceutical (life sciences) or other industries.
These are major reasons why SI has developed a large and ever increasing number of , say, application systems, see http://support.sas.com/software/.
2 . Apparently J.G. talks more frequently and more intensely to company CEOs than he does to the declining group of SAS-professors or to the strongly growing group of R-professors – the two words just being used for brevity. You can read here (http://www.theaustralian.com.au/australian-it/the-world-according-to-jim-goodnight-blade-switch-slashes-job-times/story-e6frgakx-1225888236107
) how it came that J.G. or SI invested many millions of $ and redirected highly capable SI-staff to re-invent and re-work large elements of SAS (including SAS/STAT) in order to get some type of VaR calculation down from 25 hours to maybe 1-2 hours. Who of us (professors) will ever buy or even only command a blade server for $ 500 000 ? Yet companies are willing to invest $ 4 million into hardware and J.G. offers them a big saving where he reserves only a small part for increased SAS licenses and to secure the future of SAS.
3. If it could be determined (has anyone heard rumours at a recent SAS conference?) that doing the same development for GPU-type hardware is linked to the strategic future of SAS J.G. would repeat that procedure. However, clearly, the same competent SI-staff then could not work on research projects or on new SAS/STAT procedures. http://decisionstats.com/2010/09/16/sasbladesservers-gpu-benchmarks/.
4. At our place we reap considerable benefits from the fact that we use SAS for courses and projects in Statistics, in Operations Research, in Simulation, in Data Mining, etc. . Without interruption for more than 20 years now our graduates be it with a Bachelor’s, a Master’s degree or a PhD find better than average entry into the employment system. My personal impression is that the shortage of SAS-educated people has increased in recent years , same as the number of people with some surface-contact with SAS seems to have increased. I doubt if the latter can be substituted for the former. Many people at SI apparently never in their life have used a SAS/STAT procedure for non-trivial but anyday statistical work.
SI seems to be very slow when reacting to the basic trends that you describe and that I can’t deny.
Presumably their own (impressive) revenue-figures generate a different reality.
5. The question is : To what extent really is pre-SAS-knowledge and experience from university needed to secure the future (licenses to medium and big companies all around the world) of SAS and SI ?
If it is important then SI and big SAS-shops should see shortages when recruiting even today.
And they should and very easily could take steps – the money being available.
One local activity ‘at home’ is described here : http://www.sas.com/success/ncsu_analytics.html.
But they should realize that it takes quite a long time to grow some ‘new’ SAS-professors from the R-pool. What I see today is that out of 4 SAS-professors upon retirement the replacement are at least 3 R-professors. This applies to statistics in a wide sense.
I should guess that it’s impossible that J.G. or J.P.S. never have talked with someone about these figures. So it could be that SI has ‘freed’ itself from its former roots in Statistical Analysis and confidently can trust that IT-managers and CEOs all over the world will decide even if they never had hands-on SAS stats experience.
I’d like to read your remarks about that two possibilities.
Hi Werner,
Thanks for your interesting comments. Here are my thoughts:
1. Regarding the satisfaction of SI’s customers and employees, both are indeed impressively high. I’m a big fan of the fact that SI uses research themselves in the form of the annual SASware Ballot. They’re truly customer driven. Their willingness to pump 25% of revenue into R&D is amazing. I don’t think any publicly-held companies match that. JG was wise to stay private.
I also agree that they’re more of a solutions company than a stat package company at this point. I’d love to see their revenue broken down into: SAS, Enterprise Miner, and Solutions. I suspect solutions are the majority of revenue by now.
2. I hadn’t seen that interview with JG; thanks for sending it. As impressive as that speedup is, those types of changes are happening consistently as researchers slowly coverts algorithms to take advantage of cluster computing. Some of the cutting edge work is being led here at The University of Tennessee by Jack Dongarra’s team. Their PLASMA project focuses on optimization of algorithms for multi-core processors and their MAGMA project does the same for GPUs. The results of both projects will eventually replace the algorthms in their LAPACK library. Being open source, much software has those algorithms inside including R and MATLAB. I think it very unlikely that SAS, or any vendor, could have a speed advantage for long since they’re competing with a huge number of volunteers who typically do their work as open source projects.
3. Regarding GPUs, I’m very interested to see how they do in the future. I have a sticky note on my monitor that says you can keep a CPU up to 95% busy in routine HPC calculations, a many-integrated-core (MIC) chip (e.g. Intel’s Phi chip) 65% busy but GPUs are hard to get above 50% busy. So I wonder if MICs will beat out GPUs in the long run. Either way, we win! (Sorry, but I didn’t keep the reference for those figures.)
4. You comment about how many people at SI use SAS. A friend who works there said that he was surprised at first to see how few people at SI actually use SAS. Since they write it, they use C much more often! I’m sure their consulting arm members use SAS. Knowledge of SAS is still in very heavy demand as you can see from the jobs graph in The Popularity of Data Analysis Software.
5. I definitely see the trend you mention regarding what professors teach. The older ones use SAS and they’re retiring one by one. Those under 45 or so use R. Students would be wise to learn both to maximize their job opportunities. However, given the high level of freedom that professors have to teach whatever they want, R may well dominate what is taught in academia eventually. That can’t be good for SI.
Cheers,
Bob
I’m curious why there is no mention of S-Plus?
Hi Bill,
I used to track S-PLUS until the usage figures dropped too low to be interesting. I’m still signed up for its main discussion list, but there are perhaps a dozen posts on there in a year versus thousands for the others. I find it very interesting that Tibco has developed their own version of R, Tibco Enterprise Runtime for R. I would be surprised if they continued to develop S-PLUS for very long now that they’ve done that.
Cheers,
Bob
Matlab is heavily used in Machine Learning. Unfortunately. It’s an ugly language that might not have been bad in 1980. (Just as SAS wasn’t bad for 1970 and punched cards.) There is an open-source (GNU, actually) version of Matlab called Octave that is highly compatible with Matlab (unlike PSPP and SPSS), so if you do decide to track Matlab, you should probably also track Octave.
As for the shape of each package’s growth curve, I think it really depends on what usage cases grow more than others. A single user professional (economist, statistician, etc) working at their desktop or on an laptop will love Stata, and if that kind of market is the key, Stata could easily surpass SAS. A data miner, on the other hand, is going to prefer SAS, SPSS or R because of their ecosystems/infrastructures. A Data Scientist (TM) might prefer R, Python, and Java (Weka?) because they want to easily hook into a huge Hadoop back end. (Though SAS and others are smart enough to see that and are heading in that direction.)
And so on. I have used R for years, and recently got Stata. And I’m torn between the two. Stata has a better interface and more built-in options, along with very in-depth documentation (PDFs). R is way more flexible, and has packages to do about anything related to statistics, machine learning, or anything related. I’ve found a couple of places where Stata gives better results than R for me (ARIMA, for one), and if I’m doing something that doesn’t involve going off of the rails, I actually prefer Stata. But if you just want to put a bunch of things into a graph, or use a different package, or program something quickly, R is way nicer.
How About Tableau Software + R?
Hi Oliver,
As you probably know, Tableau has an interface to R, so I expect the two work very well together. In fact, there’s only one package I can think of offhand that lacks an interface to R: Stata.
Cheers,
Bob
Hi. What do you think about Doronix Math Toolbox?
http://www.doronix.com/statistics.html
I’ve found it recently. Haven’t used it much.
Hi James,
That’s the first I’ve heard of it. It looks pretty sparse.
Cheers,
Bob
Sas is going to be around for a long time. As other commenters have noted, academia is only one slice of the statistical consumer world. When you consider the context in which these tools are used i.e. healthcare, you quickly realize that in this day and age of medical record security, metadata mangement, performance (large data), ETL of your data, data quality, report delivery methods and mechanisms, secure integegration with other applications (MS Office) and a host of other very important factors when using statistical tools, you realize the need for a robust system to manage this environment, and that is where SAS’s BI system environment becomes very important. It provides an integrated working environment. And in the business world that is just as important as the tool itself. So while all these new tools are making in-rodes into the corporate world, they’ll have the same hurdles to navigate that SAS has already done successfully.
Hi Mark,
I agree that SAS will be around for the long term for many of the reasons you mention. Those advantages are not very important to an academic environment. In fact, they add layers of complexity that academics don’t want to bother with.
Cheers,
Bob
If a similar search was done in 1996 it would probably have shown most published papers that published a code file used Stata or Matlab. Reasons: (1) Cost. (2) Easier to package as an executable file from a library. (3) It was expected by academia. A few years later R was big.
But, SAS remained strong. The fact that it could handle massive amounts of data so fast and could do descriptive, charts, and data prep so well were important. SPSS gave point-and-click ease and nicely formatted output.
If R can handle vast amounts of data fast like SAS, provide data prep and manipulation like SAS, and good output, then it could challenge SAS in the commercial space.
before I start using a new algorithm available through R, it has to be tested and validated by experts and users. This reduces the number of useful algorithms available through R
Hi Vijay,
It is indeed important to check the source of R algorithms, just as you would do for a SAS or SPSS macro that you got from someone’s web site. New algorithms are usually developed by university professors using R. They go through the peer review process to get published. Then SAS/SPSS/Stata programmers read the journal article and implement those algorithms. So who should you trust more, the algorithm’s inventor or the commercial programmer? I think they are of similar quality.
Cheers,
Bob
Hi,
I love to read all your comments and different thoughts about changing world of Analytics Tools…
Here is my thoughts:
I am using SAS/Base, SAS/STAT and SAS/EG for my development since 4 years…I read so many things, of course this is HOT topics for data science…I agree that R impacted the popularity of SAS across the world but it does not completely shut down the business of SAS, it is like ocean and no software can close the business of others, yes it have some or larger impact impact on popularity…
Even there are so many other software which you have not taken into account like SHAZAM, Eviews and so on…They also have impact on SAS as well as R but all together can never shut down the business of others…
It purely depends on companies choice and individual views…
You are R developer, so anyhow you will prove R is better than any other tool…I am SAS Developer so i will say SAS is better…
I believe, at least one software is necessary to carry out our analysis doesn’t matter R,SAS,SPSS or even Z (might be someone will release)…And we can learn any new technology within few period of time…Even at SAS R&D, we have hired one lady who was R Developer and now she used to work on SAS/Stat…
So the moral is R is getting good market as well as popularity but it does not mean SAS will shut down in 2015 or near future because of R popularity and its market…
-Urvish
Hi Urvish,
I agree with you, and said so in the article.
Cheers,
Bob
Hi Urvish,
I agree with you.
This all things are depend on Software ability, cost and skilled person availability.
Cheers,
Gnaneshwar
2015 is only a couple days away. Can we get an update on what the data looks like now? 🙂
Hi Foo,
The most interesting data has a multi-month lag time as journal articles are published on paper, then put online, then indexed by Google. I usually update the figure at the end of March, though I know some journals still aren’t online yet!
Cheers,
Bob
Bob what is your position on PALANTIR for data analysis? Is PALANTIR compatible with these other technologies and if so, what do you see as the impact to market share and usage?
Hi Tom,
Sorry but I’m not at all familiar with Palatir.
Cheers,
Bob
It would be interesting to include the combination of JMP (and/or JMP Pro) with SAS in the analysis. JMP and SAS are complementary, and the integration of the two provides a large scope, extensibility, and an excellent GUI for running both JMP and SAS procedures along with data exploration, visualization, and interactive graphing. It seems possible that this combination could compete well with other contenders, including SPSS, Stata, and even R.
Reblogged this on Stats in the Wild.
I enjoy looking through an article that can make men and women think.
Also, thanks for allowing for me to comment!
Hi Marion,
I’m glad you enjoyed it.
Cheers,
Bob
When is the 2016 edition of this article coming out? I am sure with SAS being named #1 skilled to have by the Time.com this is the year where they are doom to disappear 😉
http://time.com/money/4328180/most-valuable-career-skills/
Why to choose between R, SAS & python when you can have them all!!!
Hi Alberto,
I’ve been collecting the latest data & it’s still shifting around for 2015. R and SAS are looking very close. I’ll write about it as soon as the incoming data on journal articles for 2015 settle down.
Cheers,
Bob
SPSS and Stata need to make a move and be able to work with big data (much bigger than memory) easily. If not, they will be replaced by new tools mixing good GUIs such as Tableau or Qlikview with specialized underlying distribitued frameworks such as Spark or sciDB.
But R is not the alternative. R is free but it lacks of a good GUI where to manipulate data and plots easily, and it also not able to work properly with very large datasets. There are some packages that do some simple tasks but it’s impossible to do things such as fitting a mixed effects regression or a bayesian model on a 1TB dataset.
Another problem with R is that it’s a chaos, too many packages to perform similar tasks and using different syntax.
Even besides big data R is going to be replaced by Julia, much faster and easier.
Hi Juan,
You make some excellent points. Stata and R both must usually fit their data into the computer’s main memory, making use with Big Data a problem. However, R does offer a good interface to the types of software that excels at working with Big Data, such as Apache Spark and H2O. SPSS has always had the ability to work with data that exceeds the size of your system RAM. It may not be ideal for work on petabytes, but it can handle many gigabytes of data.
Two trends are making memory less of an issue. First, buying a computer with 1 terabyte of RAM costs only $10,000 which is far less than the cost of some proprietary software like SAS. Second, renting such a machine from Amazon or Microsoft for a few hours to solve a large problem is quite cheap.
R’s package chaos problem is certainly real, but I would prefer to have such a wealth of riches than be stuck with a system that does far less, but is easier to learn.
I too look forward to seeing how far Julia will go. It’s certainly fast!
Cheers,
Bob