I recently updated my plots of the data analysis tools used in academia in my ongoing article, The Popularity of Data Analysis Software. I repeat those here and update my previous forecast of data analysis software usage.
Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. As you can see in Fig. 1, the use of most analytic software is growing rapidly in academia. The only one growing slowly, very slowly, is Statistica.
While they remain dominant, the use of SAS and SPSS has been declining rapidly in recent years. Figure 2 plots the same data, adding SAS and SPSS and dropping JMP and Statistica (and changing all colors and symbols!)
Since Google changes its search algorithm, I recollect all the data every year. Last year’s plot (below, Fig. 3) ended with the data from 2011 and contained some notable differences. For SPSS, the 2003 data value is quite a bit lower than the value collected in the current year. If the data were not collected by a computer program, I would suspect a data entry error. In addition, the old 2011 data value in Fig. 3 for SPSS showed a marked slowing in the rate of usage decline. In the 2012 plot (above, Fig. 2), not only does the decline not slow in 2011, but both the 2011 and 2012 points continue the sharp decline of the previous few years.
Let’s take a more detailed look at what the future may hold for R, SAS and SPSS Statistics.
Here is the data from Google Scholar:
R SAS SPSS Stata 1995 7 9120 7310 24 1996 4 9130 8560 92 1997 9 10600 11400 214 1998 16 11400 17900 333 1999 25 13100 29000 512 2000 51 17300 50500 785 2001 155 20900 78300 969 2002 286 26400 66200 1260 2003 639 36300 43500 1720 2004 1220 45700 156000 2350 2005 2210 55100 171000 2980 2006 3420 60400 169000 3940 2007 5070 61900 167000 4900 2008 7000 63100 155000 6150 2009 9320 60400 136000 7530 2010 11500 52000 109000 8890 2011 13600 44800 74900 10900 2012 17000 33500 49400 14700
ARIMA Forecasting
I forecast the use of R, SAS, SPSS and Stata five years into the future using Rob Hyndman’s forecast package and the default settings of its auto.arima function. The dip in SPSS use in 2002-2003 drove the function a bit crazy as it tried to see a repetitive up-down cycle, so I modeled the SPSS data only from its 2005 peak onward. Figure 4 shows the resulting predictions.
The forecast shows R and Stata surpassing SPSS and SAS this year (2013), with Stata coming out on top. It also shows all scholarly use of SPSS and SAS stopping in 2014 and 2015, respectively. Any forecasting book will warn you of the dangers of looking too far beyond the data and above forecast does just that.
Guestimate Forecasting
So what will happen? Each reader probably has his or her own opinion, here’s mine. The growth in R’s use in scholarly work will continue for three more years at which point it will level off at around 25,000 articles in 2015. This growth will be driven by:
- The continued rapid growth in add-on packages
- The attraction of R’s powerful language
- The near monopoly R has on the latest analytic methods
- Its free price
- The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (IBM is loosening up on this a bit)
What will slow R’s growth is its lack of a graphical user interface that:
- Is powerful
- Is easy to use
- Provides direct cut/paste access to journal style output in word processor format
- Is standard, i.e. widely accepted as The One to Use
- Is open source
While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its full range of capabilities and its speed of use. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but with so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used software.
The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use are not sure which GUI to teach, so they continue teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a respectable GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.
The use of SPSS for scholarly work will decline less sharply in 2013 and will level off in in 2015 at around 27,000 articles because:
- Many of the people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
- Many of the people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
- Many of the people who needed more interactive visualization have already switched to JMP
The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.
Although Stata is currently the fastest growing package, it’s growth will slow in 2013 and level off by 2015 at around 23,000 articles, leaving it in fourth place. The main cause of this will be inertia of users of the established leaders, SPSS and SAS, as well as the competition from all the other packages, most notably R. R and Stata share many strengths and with one being free, I doubt Stata will be able to beat R in the long run.
The other packages shown in Fig. 1 will also level off around 2015, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.
The future of SAS Enterprise Miner and IBM SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes. Both companies could significantly shift their future by combining their two main GUIs. Imagine a menu & dialog-box system that draws a simple flowchart as you do things. It would be easy to learn and users would quickly get the idea that you could manipulate the flowchart directly, increasing its window size to make more room. The flowchart GUI lets you see the big picture at a glance and lets you re-use the analysis without switching from GUI to programming, as all other GUI methods require. Such a merger could give SAS and SPSS a game-changing edge in this competitive marketplace.
So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to do your own forecasts and add links to them in the comment section below. You can use my data or follow the detailed blog at Librestats to collect your own. One thing is certain: the coming decade in the field of analytics will be interesting indeed!
The problem with R is that it is not validated. I cannot imagine a pharma going from SAS to R. In addition, the mixed tools in R seem to be in a continual state of developement, and sometimes people note that the current versions have different results. I am a SAS user, and learned it in 1975, so I am not on the cutting edge. However, until R can 1) demonstrate that the system is dependable and 2) is accepted by the FDA for a NDA, you will see SAS used. In 20 years? Dunno about that.
Not wholly true, see the document on the R web-site “R: Regulatory Compliance and Validation Issues A Guidance Document for the Use of R in Regulated Clinical Trial Environments” (http://www.r-project.org/doc/R-FDA.pdf).
R is accepted by the FDA, in fact my understanding is that they do not explicitly state any software should be used, merely that whatever should be used is validated (there has been various discussion on this matter on the R-help mailing list over the years and more recently on the MedStats mailing list, search the archives of each if you want to know more).
R is already used for submission-related work in big pharma and we have set up validated R servers to this purpose for our customers, so if there is to be a transition it has already started..
Utter FUD! R already is in use by Pharma and is acceptable to the FDA; they don’t mandate *any* software to the best of my knowledge, all have to be validated. R has a compliance statement in that regard.
It is somewhat sad that you have this backwards. How can you know that SAS is doing what it says – it is closed source. R is completely open so you or anyone can check what it is doing and validate it. Something that many people do all the time.
I would like to point out one thing that often goes unstated in these R v. SAS testing discussions … yes R is tested and I can certainly write my own tests with it to convince myself that my specific routine works correctly … BUT SAS guarantees numerical precision across different chip sets and OS’ … this is very difficult, tedious work that I’m not sure is done in CRAN R.
Has anyone ever compared a numerical routine in R, especially one requiring numerical derivatives, across say 32-bit Windows and 64-bit Linux? If you get different numbers, which numbers are right?
Hi ph,
I handle the research software contracts for The University of Tennessee and I’ve read each new contract with SAS Institute. I’ve never seen them guarantee the accuracy of their software in writing. On the other hand, I think they’re a very trustworthy company that goes to great lengths to ensure the accuracy of their products. The main R download is also tested very carefully for accuracy and it has been found to be quite accurate by independent researchers (see http://www.r‐project.org/doc/R‐FDA.pdf, A comparative study of the reliability of nine statistical software packages, Keeling & Pavur; The reliability of statistical functions in four software packages freely used in numerical computation, Almiron et al). R calls LAPACK, the same set of subroutines used by two commercial packages that are highly regarded for their accuracy: MATLAB and Maple.
Where accuracy becomes more of a concern is with add-on packages. As with a SAS Macro presented at the Global Forum and downloadable from the author’s web site, you have to decide if you trust the source or if you’re going to test it yourself against known solutions.
Cheers,
Bob
Might be a bit late to the game, but I’d like to also note that Stata also makes fairly extensive use of the LAPACK (http://www.stata.com/manuals13/m-5lapack.pdf). Stata seems to get left out of many of the conversations. Personally, I think there is room for them all to coexist. The biggest issue I see with SAS and SPSS are that they are cost prohibitive for smaller organizations and in some cases inaccessible to individuals. Their language/style conventions also don’t seem to say much about the respective organizations’ willingness to move toward more modern semantic paradigms. There are things that I am really starting to enjoy about R and things that I still find infinitely easier in Stata. In general, it seems that discussions like this tend to just create tension between each platform’s fanboys – so to speak. Instead, maybe more fruitful conversations could come from discussing the various strengths of each platform.
Hi Billy,
Discussing the pros and cons of the various packages does tend to be about as contentious as comparing religions! That’s why for the most part I focus on measuring popularity or market share. At the moment R and Stata are the two most rapidly growing analytics packages used on scholarly articles (see ). I think that’s due to the fact that both are extremely extensible.
Cheers,
Bob
Please, tell it Amgen, Merck, Pfizer, Novartis, FDA (yes!) and others 🙂 R is not validated? R is perfectly validated by more than 2 million of users having the ability to look into the code. Don’t even try to spread the panic. And please don’t write “dunno”, just check the things before you post any comment if you want to be seen as a professional.
Did you know that every thematic section of the R repository has own academic supervisor? Many of them come from prestigious academic sites (Princeton, Stanford). I trust more them than closed sources and claims that “they do their best”. I have been working with R for last 8 years and always validated results with SAS and STATA. Have not discovered any dramatic discrepancies. It means that both SAS and R are well done. With the difference that I can go to cran.r-project.org/src/contrib/Archive/, download any package I want, unpack it and look into the code (both R, FORTRAN and C). Most of authors of packages I use place references to literature and I can compare the written and implemented formulas. And I did this 3-4 times – just out of curiosity how it is done, and found no issues. Many people working in clinical research use R in validated environments, having their results validated with other packages following SOPs. And this excellent, indirect validation of the quality provided by R. Keep your head cool, don’t get mad – FDA does NOT endorse or require any specific software! The software you use must be compliant with 21 CFR Part 11 guidance. And regarding R, there is a document covering this topic. Thousands of biostatistcians use R, great, well known pharmaceutical companies use R in their submissions. SAS is not the only package you can make use of.
Hi Peter,
Thanks for reporting on your experience comparing the accuracy of R to other software. When you mention “academic supervisors” are you referring to what CRAN labels the “maintainers” of the Task Views section? I wasn’t under the impression that they did any testing on the vast array of packages in their task views, but if you know of documentation that says otherwise, please send it my way.
Cheers,
Bob
I thought we already had a winner in R GUIs called R-Studio. I took two data analysis courses in coursera and they used and recommended all students use R-Studio.
When bob refers to a gui, he’s talking about something more like SPSS than Rstudio. Rstudio has its place, but it still requires R scripting. His dream is to have an easier to install and more functional version of deducer.
Exactly. I love RStudio for what it does. I’m in it all day long. But even at the graduate student level, 80% or more of people just will not program.
While R-Studio looks like a beautiful IDE, it isn’t a GUI application for business analysts who won’t/can’t invest in learning R as a language.
By GUI, they mean something like SAS Enterprise Guide or Rattle. Most business analysts I know have never learned an object-oriented language, they might pick up bits of Visual Basic and the most popular language they learn is SQL.
R-Studio, IDE: http://www.rstudio.com/ide/screenshots/
SAS EG, modern GUI: http://www.sas.com/technologies/bi/query_reporting/guide/#section=4
Rattle, old-school GUI: http://techpad.co.uk/custom/images/large/5007d3c919774.jpg
Hi Stephen, thanks for the screenshots! Cheers, Bob
Bob,
Thanks for doing so much to help people understand the world of R!
I appreciate it very much.
Cheers,
Stephen
How can we explain that the total number of hits for all the packages show such a peak in 2005? Do I understand correctly that the calculations are done on publications that report which software that were used? Maybe this practice is changing?
If anything I would have thought that the opposite is true; that we are becoming more concerned with reproducibility. That said, I do know journals in some fields that instruct authors not to both stating what stats software they used – for shame!
Hi ucfagls,
I agree completely. I hope the push toward more reproducible research starts getting journals to require stating the software complete with version and code used. I frequently get researchers coming in to do an analysis “just like they did in this article” but it’s impossible to tell WHAT they did. Luckily most authors are happy to explain when I write them.
Cheers,
Bob
Hi Roland,
That’s a question that has mystified me for several years. Other than SAS and SPSS, Systat was #3 during the “hump” years but it did not show that hump. I added all the packages together to see if it smoothed things out. It did not; the plot looked much like SPSS by itself, just at a higher level.
I’ve asked quite few people about this, including the people in charge of SPSS development, and no one is quite sure why there’s such a radical shift.
One possibility for the drop off is that the recession cut government grant funding sharply. But you’d have to assume that for some reason it affected SAS and SPSS but not the other packages. That might be caused by the fact that more established researchers A) got more grants (and so more cuts) and B) being older they also used the older packages.
Another possibility is the retirement of the baby boomers. If they tended to use the packages that have been around longer, then SPSS and SAS would have been disproportionately affected.
Cheers,
Bob
I work as a researcher for a government research foundation (in Brussels). We have cut our number of SPSS licences in half. Add-on modules were all reviewed. If we were not sure that it would be used in the coming year by a certain person, we skipped. The times that ‘let’s get one for him/her too’ to be on the safe side have turned around in ‘let’s skip it for him/her’ to be on the safe side. Moreover, funding for deata collection is also skimmed, so there’s less data driven research.
Hi Hendrik,
I’ll bet sales add-on modules for SAS and SPSS are being closely examined in most organizations. It’s often easy to do 90% of your work in the main commercial product and then occasionally call an R package to do something specialized. That approach can save a lot of money without having to retrain the whole staff in R all at once. However, having less data driven research in general runs counter to overall trends. I never thought I’d see the day when the state of statistical analysis (excuse me, that’s “analytics” now) was routinely discussed in the general press.
Cheers,
Bob
I am not argue with your analysis but I think we’ll have to wait a very long time before SAS loses its strength in the analytics market. Reason behind this is simple: legacy systems and legacy code – SAS has been here for more than 30yrs – it is cheaper to upgrade current systems (not necessarily in money but in time) with new SAS products than replace them with systems based in R for example. Did you see SAS last year profit?
Academia may not best feature for predicting the end of SAS ( perhaps SPSS 😉
Hi Alberto,
I agree, the momentum of SAS and SPSS will be slow to turn. In academia old code is helpful to have around for new projects, but it’s nowhere near as important as hundreds or thousands of SAS reports in industry. Even after scholarly work reaches some sort of long-term equilibrium, industry will have another 10-20 years before that pattern takes hold.
Cheers,
Bob
Dear Bob
I was surprised by the continuous fast growth of STATA. I thought that its peak moment was behind. Even in economics (where for a number of reasons -basically its powerful panel data capabilities) it is quite popular, more and more people are moving to R (in my own department, first year Ph.D. classes are now 100% R).
Any thoughts?
Dear Jesus,
I, too, was surprised by the continued strength of Stata. Below a list of similarities from my book with Joe Hilbe, R for Stata Users. I thought these similarities would mean that Stata would be hit the hardest by competition from R. However, lately in my workshops when I ask the Stata users if they program or use the GUI, around 90% say the GUI. That’s quite a switch from the early Stata users that I knew who where there for its excellent programming language. So I suspect that R’s lack of a point-and-click style GUI is keeping many Stata users from migrating. I also think that even for programmers Stata’s language is easier to learn, though it may be slightly less flexible (e.g. it’s easy in R to create entirely new data structures).
Cheers,
Bob
From “R for Stata Users” (http://r4stats.com/books/r4stata/):
“Perhaps more than any other two research computing environments, R
and Stata share many of the features that make them outstanding:
• Both include rich programming languages designed for writing new analytic
methods, not just a set of prewritten commands.
• Both contain extensive sets of analytic commands written in their own
languages. [for a clarification, see http://librestats.com/2011/08/29/how-much-of-r-is-written-in-r-part-2-contributed-packages/%5D
• The pre-written commands in R, and most in Stata, are visible and open
for you to change as you please.
• Both save command or function output in a form you can easily use as
input to further analysis.
• Both do modeling in a way that allows you to readily apply your models
for tasks such as making predictions on new data sets. Stata calls these
postestimation commands and R calls them extractor functions.
• In both, when you write a new command, it is on an equal footing with
commands written by the developers. There are no additional “Developer’s
Kits” to purchase.
• Both have legions of devoted users who have written numerous extensions
and who continue to add the latest methods many years before their competitors.
• Both can search the Internet for user-written commands and download
them automatically to extend their capabilities quickly and easily.
• Both hold their data in the computer’s main memory, offering speed but
limiting the amount of data they can handle.”
Bob: Your GUI — programming distinction doesn’t make much sense for Stata. It has menus (in practice mostly only for official commands), it has a command line interface, and it has a do-file editor for developing scripts. So, what are you calling the GUI? Many users move back and forth between these. In practice, according to many people I’ve discussed this with, only novice or occasional users use the menus. Also, this balance hasn’t shifted much over the years. (I’ve been using Stata since 1991.) In short, Stata is highly command-oriented.
Hi Nick,
I love Stata’s command language; it’s clear and concise. What I’ve noticed lately though is that the younger Stata users taking my workshops depend far more on the menus and dialog boxes (what I called the GUI) rather than commands. All questions I used to get were about the command language, but it has shifted over time. It could just be that more novice users are taking my workshops.
Cheers,
Bob
Dear Bob
Thanks for your thoughtful response. Your analysis moved my posterior quite a bit from my prior 🙂
Jesus, Haha! Well said, well said! -Bob
My $0.02 worth, as a STATA and R user:
Stata is quite cheap, and you can actually understand the user guides (and most of the error messages).
STATA and R both have massive ranges of (free) add-on packages.
Just installing SAS is like going back 20 years (repeatedly swapping between 6 CDs etc). This company cannot rely on inertia for ever.
Hi Blair,
Stata is superb software and for a single user system it’s not too expensive. However, our server version cost $14,000 for one of our small servers. There’s no way we could afford it on a big cluster.
The SAS installation is indeed in a class by itself. Not a good class, either. It has been years since I counted all the pages of instructions, but it was over 500!
Cheers,
Bob
— … so people tend to continue using the tool they learned in college for much of their careers.
Were that it were so. If it were, COBOL and VSAM and RPG would have died around 1980. For those who work independently, than any tool which fits the hand will do. Academics come to mind. Anywhere else, not so much. Just as COBOL has 50 years of code hanging out (and being lipsticked with java/javascript to a faretheewell), SAS/SPSS has about 40. And mindshare.
Hi Robert,
I had to laugh because at first I thought you were saying that people didn’t stick with what they learned in college because NEWER tools came along! I agree that there will be great pressure for new graduates to drop R and switch to SAS if that’s what their new employers use. Even if a company were trying to migrate from SAS to R, it is likely to take a decade or more. We just retired our mainframe, and it was a top priority to do so for almost 20 years!
Cheers,
Bob
A very interesting article, but I seriously question the validity. I am software agnostic; the tool fits the problem, not vice versa. I work at a site that has R, SAS, SPSS, Stata, Matlab and several one-offs. This is for Health Care research and the majority of younger Researchers do enter knowing primarily R. However, they almost always gravitate to a more robust, enterprise environment and it is usually the SAS/Grid. Stata is used for Health Economics (good tool) and many do use SPSS. We have found that 70% of our Research use SAS. As you know, SAS can incorporate R, but in the era of Monstrous Data (especially Health Care), I do not see R supplanting any of the mainstream products. It is my understanding that market-share for SAS is increasing. Obviously, in such a lucrative market, there will be competition and entry into the market will curtail massive growth, but I do not see a declination. I have consulted in the Financial, Health Care, Pharma, etc. industries and several Government venues. I do not see this analysis as a real-market, applicable model, but more of being a proponent for R (which is also a good tool).
Hi Mark,
I agree that each the methods I use to estimate popularity or market share are flawed in one way or another. It’s the combination of them all that I find most compelling (see http://bit.ly/statpop). I also don’t mean to imply that R is better than the alternatives. I use SAS and SPSS a lot and like them both. While I needed co-author Joe Hilbe to go in-depth on Stata, I hold it in very high regard.
I think that SAS sales are going up because they continue to introduce useful vertically-integrated solutions (e.g. SAS Fraud Network Analysis). However I suspect that the market share of SAS/Stat is decreasing simply due to competition from all sides. There are some fairly major competitors that I’m not even covering, such as Tableau and Spotfire.
Even if these trends were to continue in academia, I suspect that it would be a decade or two before they would make their way through industry.
Cheers,
Bob
Hi Bob, I would agree with your comments. I enjoy using all of them. Essentially, we are seeing the results of opportunity coupled with market saturation. Onward and upward!
I agree with a lot of that, because this view is problem-driven and takes staff turnover into account. My own experience of teaching R and Stata also converges with Bob’s “free puppy” remark below, and the comments that describe RStudio as the clear winner among R interfaces are also correct in my view, although there is indeed a difference between pushing buttons in a GUI and using an IDE.
As far as I can tell, the current market can be summarized in three trends: fast ubiquitous growth through cutting-edge innovation (R), slower sectoral growth driven by path dependency (SAS) and specificity (Stata), and decline or stagnation (everything else, including SPSS).
I agree with your point that the choice of software relates to the kind of work. For my research, data manipulation and producing publication-ready tables are the core of my work. Lots of variabeles, lots of crosstabs, that’s not the target environment of R I think.
Reblogged this on lava kafle kathmandu nepal <a href="https://plus.google.com/102726194262702292606" rel="publisher">Google+</a>.
Seems to me that the headline on this article ought to be “Total citations of stat software in Google Scholar drop 50% over 4 years.” Since I very much doubt total usage of stat software has fallen, that suggests trends in total citations do not reflect trends in total usage. That raises the question of whether trends in the proportion of citations for each stat package actually reflect trends in the proportion of usage. Perhaps SPSS users are disproportionately more likely to have stopped citing the stat package they used to obtain their results? (I can think of reasons why that might be so, but they wouldn’t explain the same thing happening to SAS.)
But putting that aside for the moment, Stata’s strength relative to R does not surprise me at all. Stata’s simply much easier to learn–even if you insist on writing programs rather than using the menus (which you should). Also keep in mind that many academic users don’t pay for their own Stata licenses, so R being free does not affect their decision-making.
(To expand on Nick Cox’s point: you have to distinguish between people that use Stata’s GUI as an IDE for writing programs and people who use Stata’s GUI to avoid writing programs. I always teach people to do the former, and if you’re seeing more of the latter that’s disappointing but not terribly surprising.)
Hi Russell,
How about the headline, “Number of Publications that use Stat Software has Increased 635% Since 1996, with a Weird Hump in the Middle.” The hump in the graph is quite bizarre! Competition from the packages shown in this set of graphs definitely cannot balance out that hump (I’ve plotted it to make sure) but it’s possible that competition from other packages that do statistics might. MATLAB, Mathematica, RapidMiner, Weka, SPSS Modeler, SAS Enterprise Miner, Spotfire, Tableau, KXEN, and Salford’s CART, TREENET, MARS, etc. must have been used in a lot of scholarly publications and many were not popular before 2005. However, in academia I don’t see the classic SPSS user using any of them. Stata is the only thing I see chomping away at the SPSS marketshare in academia.
I agree that Stata is easier to learn and use than R. In fact, I suspect that if Stata were to become an open source project, it would become the most widely used software in academia in short order. Now excuse me while I put on my Kevlar vest!
Cheers,
Bob
I don’t see Stata going open source as far as proprietary code is concerned. But a more immediate point is simple, but often missed. Stata and R are converging in real price. The price of Stata — although more than most individuals prefer to pay — is modest compared with the commercial opposition, and once you have a current Stata licence free technical support is included and lots of free user-written software is available to you. The real price of R has to include whatever training and books and support from specialist companies that users pay for. In that sense the statement “R is free” is completely accurate but nevertheless incomplete. Naturally I am not denying that Stata is commercial and R is not: just saying “measure how much you pay”.
Hi Nick,
Good point. Open source fanatics love the two “frees” — free beer i.e. free to use and freedom to change — but tend to downplay the “free puppy” one. You may totally love it, but it’s gonna cost you! I agree that Stata pricing is quite a deal for a single-user SE license for business ($845/yr) and the Small Stata license for students at $49/yr is decent. Where Stata pricing gets crazy is on servers. A 64-core server with 25 users is $75K/$40K for business/academia. The smallest cluster our group has is 5,000 cores, and the largest has 100,000. Such needs are not common, but we do use R at high scale (e.g. http://www.r-bloggers.com/r-at-12000-cores/). SAS Institute recently woke up and made their licensing for unlimited copies on unlimited servers at all our campuses for not much more than Stata charges for one small server. I’m sure they realized that too much Big Data work in academia was using R and they needed to address that.
I don’t mean to imply that any of these packages are not worth their asking prices. As long as they are able to sell the software, it’s worth the price. But I’m glad open source projects such as R and RapidMiner are there to help drive prices down.
Cheers,
Bob
I recently had to return to SPSS for one project after a long period of using R and MS Access. At first I was using the GUI but then found I couldn’t stand the lack of repeatability as the tasks had to be repeated over numerous datasets. Then I moved to syntax and it wasn’t so bad but I now strongly prefer R.
I also noticed how limited the joining of datasets in SPSS is. I would have thought this feature would be more versatile by now but it looks like the fields to join on still have to be the same name, etc.
Hi Justin,
As you point out, the repeatability factor is very important. I think that’s why so many of the newer packages like RapidMiner, Orange and Knime have adopted the flowchart GUI used by SPSS Modeler and SAS Enterprise Miner. It’s the only non-programming GUI that allows you to use it repeatedly without having to switch to programming. The Red-R GUI for R is like this, but unfortunately its progress seems to have stalled.
Cheers,
Bob
Very interesting, thanks for the analysis!
I would be delighted to see matlab included the next time…
Guten Tag Berry,
While MATLAB and R have much in common, MATLAB use is dominated by solving engineering problems rather than statistical analysis or data mining. If I could think of a way to split that use out, I would love to do it.
Tschüss,
Bob
Thank you very much for your interesting article.
I translated your article into Japanese. Please let me know if you are not comfortable with my translation.
https://www.facebook.com/masanori.yoshida.3517/posts/470142369728338
Hi Masanori,
Thanks for the translation!
Cheers,
Bob
Hey Bob,
Tim Daciuk here; I think that we did a couple of presentations and/or were on a panel together, back “in the day” (when I was part of SPSS Inc). Interesting article and interesting use of forecasting. Certainly the use of R is expanding; mostly due to the cost if R. I think however that to measure trends from a primarily from an academic/scholastic bent may be problematic. If you take. A look at ‘industry’ I think that SPSS and SAS are still the big gorillas in the market and will be for the foreseeable future. I think at this is due to a number of factors: 1) the ‘one throat to choke’ ability of having a company stand behind the product; 2) the end-to-end solutioning that SAS and SPSS offer (as predictive analytics becomes a business integrated function) which is not the with R; the development of vertical applications (mentioned earlier), and; the existing ancillary development and integration network around SPSS and SAS (though this is changing).
P.S. I tend to rely on the Colbert statistic for a lot of my work!
Hi Tim,
It’s nice to hear from you! I miss those SPSS Directions meetings. IBM priced them out of the academic market. I agree with all your points. I’ll write a new post soon based on job advertisements (mostly corporate) that reinforces your point.
Cheers,
Bob
I really enjoyed reading your article and the comments. There is however another bit that would belong into the discussion, that has not yet been mentioned (or I over-read it). Let’s start with a provocative statement: in my books, teaching SAS/SPSS at Universities almost amounts to misappropriation of funds. Let me explain. By spending vast amounts of cash for software deals, this money is then lacking at other places like lab seats or smaller classes. In return, Unis get programs that de-facto vendor lock-in their students (and faculty). A common counter argument is, that Unis need to teach what industry requires. However, I believe that Unis job is to teach knowledge, and not to vendor-lock their students in specific software. If someone understands statistics, learning SAS or SPSS is not that much of a hassle anymore. However, teaching statistics with R, being able to demonstrate how calculations are done and results come about, gives any educator a definite advantage. So I hope that R (or any other free successor, for that matter) will eventually dominate.
Forgive my tone, but at the moment I’m a little bit disgruntled because I just spent half a year trying to convince some Unis in Serbia to use exclusively R in their newly established statistics program, and failed.
Hi Christoph,
At The University of Tennessee I’m in charge of software licensing for research tools and you’re right, we spend a LOT of money on them. It’s around $350K for research only and well over $1M if we include productivity tools, ERP software and data bases. When it comes to teaching, professors face a tough choices:
Use what’s free or cheap to save the university money?
Use what’s easy so students can focus on analytic concepts instead of programming and debugging?
Use the tool that’s powerful so students will learn maximum flexibility?
Use the tool that’s most likely to get the students a job?
Depending on your perspective, each answer may lead to a different product! I hope that as the menus & dialog box GUIs for R improve and employers use R more, that all these could be fulfilled by one package. However, I suspect that SAS and SPSS will be #1 and #2 with employers for many years to come. I’ll have a blog post on that soon with the latest data.
Cheers,
Bob
Hey Bob,
thank you for sharing the figures of UT. I had not realized that you work there, I had the privilege of being an exchange student to the Bartlett area around Memphis a long time ago. Since then, Rocky Top never fails to increase my heart rate. 😉
Back to topic: I totally agree with those hard choices and certainly share your hope of R interfaces improving and thus gaining a bigger market share. They also entirely depend on the audience. For Statistics majors I would go 100% R from day one, complementing it with a scripting language as data retrieval and manipulation becomes an issue.
When teaching other subjects the choice is less obvious for me, unfortunately. I got social science majors started on R using R Commander with quite some success. However, a key issue there is applying survey weights to data, and here R GUIs don’t help. To my surprise, most of them took quite easily to the command line. And once you get to the point where you tell them that using survey weights correctly in SPSS involves much more than issuing a WEIGHT BY command, they accept R’s solution willingly. Anyway, most social science students end up producing SPSS tables and guessing their meaning. So while SPSS aids them in producing results more quickly, it does not help them to produce correct results, let alone understand them.
I’ll be looking forward to your post about employer preferences.
Best,
Christoph
Hi Christoph,
We had a large class of non-stat majors switch from programming in SAS to clicking in SPSS. There was a real concern that the SPSS approach would let the students be lazy and not learn as much about what the output meant. That happened over in the Statistics Department. Our research support group sees the students years later when it’s thesis or dissertation time. It was clear to us that the SPSS approach allowed the students to learn much MORE about what the statistics meant. With SAS programming, they spent far too much time debugging their programs. I’m sure this had nothing to do with SAS per se, but just the debugging time you’d have with any language.
However, with stat majors, I agree that they must dive in and learn to program or they’ll never do well on the job.
Cheers,
Bob
Bob, why not Stata? It’s the right middle point between programming and point and click for non stats students.
Hi Fr.,
I’m sure Stata would have done as well. SPSS was chosen by the departments that were requiring the students to take the class. My point was that when non-stat majors have only two stat classes in their entire PhD program, they’ll learn more about statistics if they don’t have to spend that time learning both programming and statistics. I think that would apply when comparing any two decent stat packages, one using programming and the other using a point-and-click GUI. Of course this was not a carefully controlled study, just an observational one. Decent research may well exist that would do a better job of settling that question!
Cheers,
Bbo
The problem with open source software is that no one is responsible if there is a crash or a bug in calculation or flaw in crucial machine data analysis. In a paid software there is a company and can be held responsible (well partly atleast) and can be asked to fix the problem. You can’t call a specific person. Business needs service people to service them for what ever runs the company or school or research, so SPSS and SAS will stay on as long as they are paid software. Piracy has made paid software equivalent to open source !, so the adapters and learners of SPSS and SAS are so high that even in future there will hardly be any letup in those two software usage. I can’t see R surging ahead in future, though I will continue to work against the blank looks I get when I say “Why don’t you do the 3×3 matrix data Fisher test in R. SPSS can hadle only 2×2”. The question I get is “what is R?” and later ” Why would you download all those different packages and write a program for it ?”
Hi Selva,
Revolution Analytics is betting that most people will agree with you and pay them for Revolution R Enterprise. Then they can pick up the phone and get immediate support.
Cheers,
Bob
I was wondering if you took into account the “renaming” of some SPSS products to PASW in and around 2009-2010. It would explain your rapid decrease of SPSS hits in that time period.
Hi Researcher,
Thanks for asking this question. I thought my query included PASW, but I just checked and it did not. I’ll fix that for next year but the impact of it will be small. From it’s peak at 155K hits, SPSS fell to 49.4K. Only 7.7% of the decline since 2009 was due to the exclusion of PASW as a search term.
Cheers,
Bob
Thanks very much for the translation into Portuguese!
Cheers,
Bob
Republicou isso em psicometricae comentado:
Add your thoughts here… (optional)
Greetings from the University of Tennessee of Chattanooga. Thanks for the article. I completely agree with your prediction that R will level off without a GUI and the 80% of people will not use a code language. That was the revolution behind the development of Windows. Today’s generation is even more anti-code as everything is graphic based.
Hi Isaac,
It’s nice to hear from UTC! I really like the combination of a GUI that writes a program that I can then customize. That way I get an error-free start and as much flexibility as I need. Most just want to point and click though.
Cheers,
Bob
To me this report is quite one sided if we were to look from this perspective:
1) How much does universities and schools are willing to pay for a environment that encourages learning technologies that are widely used in demanding reality?
Nowadays universities and schools are looking at profitability and/or cost saving rather than quality. If delivery of a course is on analytics, one may even consider analytics software or even business intelligence software, whichever does the job.
Hi Shrio,
Academia is under pressure to control costs, but I’m not yet aware of any major universities that have stopped licensing SAS, SPSS, or Stata. So far only S-PLUS has been eliminated through the use of R. I suspect that it will be at least 10 years before R eliminates any others.
Cheers,
Bob
In the full enterprise environment SAS has tools for SAS marketing automation and marketing optimisation…..I doubt r etc can provide anything that comes close to handling these tasks required in a busy marketing or crm environment…SAS is not just about analysing data,but taking things a step further and handling customer lifecycles….
Hi Radjaye,
Yes, SAS Institute is now offering complete solutions to problems so you don’t need to learn to solve them yourself. That’s a valuable service. It will be interesting to see how long it takes for a company to start offering similar solutions built using R.
Cheers,
Bob
I just bought your book. I look forward to reading it.
Hi Mark,
I hope you find it useful! If you think of things that could be improved in it, please drop me a line at muenchen.bob@gmail.com.
Cheers,
Bob
Currently I am a undergraduate student of statistics. When I want to do some analysis, I personally like Minitab due to the GUI ( I have not used SPSS much ). On the other hand I do love R but as it is discussed here the lack of a proper GUI stops many of us to use R whenever we want. For making some simple analysis, Minitab, Excel and perhaps SPSS also are good. While R will really take a lot of labour for simple analysis which is not always good.
Radjaye is right on with his comment. I liken SAS to public transportation and R to a scooter. It can go some places more quickly than SAS, but cannot carry the load needed to make it truly effective for now. What I have been doing lately is combining the SAS and R worlds. Allowing folks to use the power of SAS, divert to a R module and then port the results back to SAS to continue process. Very interesting and very well-accepted. What I see in the future, is SAS assimilating the entire R concept into their suite. I also think, this will eventually mitigate the costs of SAS downward. All good outcomes.
My mission will be optimizing R within SAS to maximum benefit. If you would like to collaborate, that would be fine.
I think “R” should release a its software for Tablets as well…it would be a revolutionary step and go beyond all competitors at one go…may be R-android app….i love “R”
Hi Ravi,
One of the odd things about tablets is that they gain simplification by hiding their file system. When I do a single analysis project, I may have files in many formats: Excel, R, SAS, SPSS, Word and LaTeX. I don’t want to go to each app to find the files for that project. I want them to be all in one folder as they would be on a PC, Mac or Linux computer. I don’t know if the iPad/Android tablets will go that direction. Windows tablets have the file manager still, but they get grief from reviewers as not being as easy to use. It will be very interesting to watch what happens in the tablet space!
Cheers,
Bob
Why didn’t you include PSPP in the analysis? For a lot of users it will do all they did with SPSS.
I also wonder if PSPP would lead people to use SPSS afterwards.
Hi Hendrik,
That’s an excellent question. Anyone can download the free SPSS clone amusingly named PSPP from http://www.gnu.org/software/pspp/. I’ve been following the software off and on for many years. Although it offers quite a lot of what SPSS does, its support for analysis of variance (ANOVA) is very weak. It does only oneway ANOVA and it lacks mixed linear models that are widely used here at The University of Tennessee. It also has no multivariate methods outside of factor analysis. If you don’t need those methods, it might be good for you. The price is certainly right!
Cheers,
Bob
For Enterprise people they can use Revolution R, which is the business version of R. The cost is much less than SAS. Also Revolution R can handle very big data, can work from Hadoop cluster and much more. Revolution R gives full support for business. Also if any company wants to switch to R from SAS they convert the code for free. Revolution R is such a software which tweaks R and does computation much faster than the regular R version. Also it can handle really big amounts of data.
Revolution R is free of cost tor academics and researches.
Just have a look at it :- http://www.revolutionanalytics.com/
Revolution R is really a good option for business people to use R. It is of much less cost than SAS and it can handle very big amount of data. Not only that Revolution R ca work in Hadoop cluster and it has a lot of tweaks than the free R version. The support is extraordinary good and it is free for academics and researchers . If any company wants to switch from SAS to R, then Revolution R converts the codes automatically free of cost.
Have a look at it here :- http://revolutionanalytics.com/
I use both SAS and R. In fact, I recently worked R into our SAS environment. I agree that SAS is expensive, but if you compare SAS and any form of R together, I don’t see any area where R is superior. Point of fact, from members of our research community, many of the R code they find does not validate correctly. SAS is justified at many sites as the software of choice, for it’s superior capabilities. As R competes more, it will become more expensive. SAS market share is increasing, not decreasing. I am completely software agnostic, this is how I and others perceive it.
Yes SAS is the software of choice in many places and I think this is because of the fact that SAS is older than R. Many people are used to SAS and so they don’t want to trust a newcomer in the analytics field. As for superiority of R over SAS, take these as examples :-
The graphics produced by R ( with package ggplot2 ) is far superior than that produced by SAS. Also integration of R ( actually Revolution R ) with Hadoop is much more easier to handle with big data sets conveniently at lesser costs. There is no doubt that SAS still dominates the job market, but the growth of R is much more than the growth of SAS. And when big companies like Google use R for their analysis there is something in R – what do you think ?
I find your reasoning very flawed in many respects. If you have seen the release of 9.4 SAS recently, you could never have made those comments. I have seen the R and SAS graphics and I cannot agree with your comment of R superiority. Please view the SAS/Graph package, ODS delivery and the graphics inherent in their statistical procedures. SAS has released a new language that allow for much better data handling in the any relational DB (FEDSQL), and an interface with Hadoop that allows for seamless interaction. They have also added a new language in the Data Step to augment their Hash program that will process data several times faster than R can even imagine. Their tools are reliable, leading-edge, proven, innovative and continuously enhanced. The growth of R is slowing as it is aging in the market. Don’t you think that if R could replace SAS companies would flock to it? I see the opposite happening. SAS has release a very inexpensive version to academia and I have seen a vast increase of SAS individuals graduating. Finally, for every Google I hear in regards to R usage, there are 50 times as many firms using SAS (Financials, Pharmaceuticals, Health Care Research and Providers, Governments, etc.).
Bob, wonderful article and forum to discuss this topic. You mentioned RapidMiner in one of your earlier posts. I am curious as to what your thoughts are on KDNuggets’ yearly poll on what analytics software is currently being used (http://www.kdnuggets.com/polls/2013/analytics-big-data-mining-data-science-software.html). RapidMiner seems to be rising in popularity and is becoming more of a commercial solution, has some good open source backing, and seems to be providing a good GUI interface layer to a fairly robust backend. It supports many of the newer datamining techniques and seems to be becoming more polished. I currently work for a U.S. telecom and the predictive analytics department I run currently uses SPSS Modeler, but I have been keeping my eye on RapidMiner and R (although I have used neither) as they seem to be rising in popularity in the business community. Thoughts?
Hi Cris,
I think RapidMiner is the most interesting open source analytics software next to R. I like their AGPL approach of making the older versions free while charging for the most recent “commercial” one. I’ve only gone through a couple of tutorials on it, but the interface looks well designed. It’s certainly easier to get started in RapidMiner than R.
Cheers,
Bob
My 2 cents; I teach quant methods in sociology do research using super complicated register data, meaning that easily 95 % of the time invested on a paper goes to data manipulation. Stata is currently way superior in data working to any competition I know. So even if we want to use a method that is only available in R (which is the case quite a few times in fact) the data work is almost always done on Stata. This applies also to quite a few people who would describe themselves as primary R users. This may explain why R and Stata growth goes hand in hand, at least for now. This is the reason also why R can easily be used for teaching statistics relying more on the methods themselves but is not that well suited for teaching how methods are applied in research in practise (I have used both).
A second advantage of Stata over R is that using PCs Stata’s multicore application does not require any additional steps from a user — a convenience that is not available in R as far as I know.
I really think the GUI advantage of SPSS is overrated.
Hi Jani,
When I started using R in 2005, I often saw people on Internet forums saying they needed SAS for data management and R for analysis and graphics. But some R gurus said that R could do all those tasks, so long as the data fit into RAM. To see who was right, data management was the first area I studied in R. The result was my workshop Managing Data with R, which summarizes the 160 pages in R for SAS and SPSS Users, 2nd edition. My coverage of the subject in R for Stata Users was somewhat less, at 108 pages. R is not only capable of handling all the data management situations I’ve seen, but it does so with great elegance thanks to Hadley Wickham’s packages plyr and reshape2.
R has had multi-core support since version 2.14.
I think Stata is a wonderful package, with a more consistent and extensible language than most other packages. It’s also much easier for a beginner to do a lot with a small amount of Stata know-how. A small amount of R knowledge just leads to frustration.
Regarding GUIs, I think Stata’s is about as good as SPSS’.
Cheers,
Bob
Check out the free version of SAS Enterprise Guide called SAS® OnDemand for Academics http://www.sas.com/govedu/edu/programs/od_academics.html. SAS provides one-year license at no cost for instructors and students who wish to use SAS. It only works for Windows operating system. There is also free SAS Web Editor – a service that allows you to access SAS over the internet. No need to download. The only drawback I saw so far with the free SAS® OnDemand for Academics is that it is pretty slow due to the server issue.
Hi Trang,
I have no doubt that those free versions are the direct result of competition from R and RapidMiner. I’ve heard they’ve sped it up with bigger servers, but students can only analyze data that the professor has put online. That makes experimenting with your own data frustrating.
Cheers,
Bob
We use SAS, Stata, SPSS and R. The new version of SAS 9.4 with additional data languages (DS2, Fedsql, etc.), the upgrades to the interfaces and statistical procedures, puts both Open-Source and Commercial R back some years. We also have a SAS Grid, which increases our Statistical ROI minimally 3X. 75% of the non-SAS entrants migrate to SAS as their primary language. The other Gentlemen is correct; since SAS has given a very low-cost option for Academia, I am seeing a monstrous increase in students with a SAS background. Remember it is a moving target and I think SAS has just raised the pot! If you are comparing contempary R with SAS of even a few years ago, you will be astounded as to the improvements.
Hi Mark,
When I started to learn statistics, I heard about SAS first and R was very distant from me. Then I do all my work with Excel and Minitab , because they were enough for then. But I personally had a strong erge to learn SAS, as my professors told me that it is really a good language. I spent almost half a year to learn SAS in a convenient way. But I don’t find one. The main obstacle was that SAS is not free even for students. The low-cost option that you are saying is not too low for a student who just want to learn SAS out of interest . Also the SAS on demand is not helpful always. Being really frustrated I shifted to R and now is learning it. R is always available at my fingertips while SAS is not. Also Revolution R , as mentioned by boral1 and anilde above is also a very good software and it really has enhanced R a lot. Many things which one cannot do in R can be done in Revolution R. Revolution R is free for academics and is really a good software and I personally will support R ( and also Revolution R ) for this openness for students, which SAS don’t have.
Interesting comment and I understand your position completely. However, I believe that my point was that R and Revolution R are fine products for academia or very small projects. However, in the corporate environment, government applications or any “Real-World” application, it really competes in a very small niche. I use both and others as they apply to my projects. Like most shops that contain many SAS, R, Stata or SPSS, the vast majority (75%) is done in SAS and split between the rest. SAS gave many academic institutions a very-low cost cloud option and the number of grads with SAS experience has increased expotentially in the last few years. I also support R, but as your own professors said, SAS is a very good language to learn via its dominance. Good luck in your studies!
I use SAS to teach and I ONLY use real-world examples – the California Health Interview Survey, Trends in International Math & Science Study and more. It is true that I have to upload the data to the SAS Web Editor, as the professor, but all of our data analysis is done with real data and there is nothing forbidden by SAS. When students did dissertations, they could use SAS as well
Hi Drannmaria,
I like real-world examples, especially when they allow students to solve new problems. That’s usually where SAS Institute draws the line. You’re welcome to use public third-party data as you mentioned, but if you want your students solve real-world problems for companies, SAS won’t allow it (unless they’ve changed their licensing very recently.) I handle the academic contracts for about 30 software vendors and they almost all have that clause. I believe Revolution Analytics is an exception, but I haven’t read their contracts in around five years.
Cheers,
Bob
Interesting. I learned SAS in college (in 1979) and with the exception of a brief period where my office was using SPSS have always used SAS for work. We presently support both SAS and R.
I thought SAS had a program in place that provides SAS free to educational institutions.
Hi Robin,
Yes, SAS Institute does offer the use of SAS on their cloud systems for free. The professor submits data sets and students can analyze those. However, to install the software on their own PC, the school will have to pay for the license. It’s inexpensive per copy, but still runs into tens of thousands of dollars.
Cheers,
Bob
Hi a better GUI for R is Tableau Software. you can trial it for free on our website. It needs no programming.
Hi David,
I don’t quite understand your comment. In your web site’s document titled, “Using R and Tableau” it says:
“Who is this feature intended for?
This feature is primarily targeted for users who are already proficient at R. It is
NOT meant for beginners with R. Anyone who wishes to use the new functions
must first learn how to use R in order to leverage its capabilities in Tableau.”
Which is correct, your comment or your company’s documentation?
Cheers,
Bob
Hello all. First and foremost, what a delightful analysis Bob! I enjoyed the read.
Now, on to business. Currently, I am a PhD student in Biomedical Informatics and I use both R and SAS. I have seen the work with Bioconductor, which I really enjoy and look forward to the further developments found in the Bioconductor project. It is my understanding that some hospital systems, most notably the Mayo Clinic, have started to use R in their data analysis and management. My question is do you believe the medical sector will ever trend towards using R more frequently than SAS and if so, do you have an estimated time this might occur?
Thanks!
Hi William,
In the area of bioinformatics, I would guess that R is already more widely used than SAS due to the wide range of functions that have been added to the Bioconductor project. However, I would expect that in the larger biomedical market that SAS still dominates. Those are both just guesses though. It’s hard enough to get solid data on research as a whole let alone breaking it down by application segment.
Cheers,
Bob
Bob
I have used Stata regularly since many years and have also introduced some of my PhD students in this program. I now attend a course in Statistical learning that uses R. I have done my first introduction lesson and am almost lost. There are so many assumptions. Just an example – to define the working directory you can not use backslash (which is the nomal for Windows) but the “/” instead. Toock me 10 minutes to find out by trial and error. And the help file use such difficult neologism or technical terminology that it is non-interpretabel for the uninitiated. Compare
?summary in R and help summarize in Stata for instance.
I am a surgeon and use Stata maybe 3-5 hours a week. I very often need to check out the help file to get things right. Maybe R can work for someone that stays at the keypad 20-30 h/week and is also a professional statistician.
I have heard so much enthusiastic comments about R so I decided to try it out but am sofar doubtful. I probably can do all I want in Stata.
Roland E Andersson
Hi Roland,
I heartily agree with your comments. In fact, in my book, “R for Stata Users” I use the help files for displaying your data as an example. For Stata it’s crystal clear: “LIST displays case values for variables in the active data set.” But in the R help file for the similar print function has this cryptic description: “print prints its argument and returns it invisibly (via invisible(X)). It is a generic function, which means new printing methods can be easily added for new classes.” So it prints “invisibly”?? Well no, but you get that impression. You also need to understand classes and methods to know what it’s talking about.
I provide several other reasons why R is not for occasional users in my post, “Why R is Hard to Learn“.
Cheers,
Bob
I NEED A INFORMATION ON SAS/SPSS WHICH ONE POPULAR & WIDE ANALYSIS BENEFIT IN MARKET RESEARCH (WIDE TOOL,EASY ERROR RECTIFICATION, EASY TO ANALYSE DATA)
Hi Romio,
If you’re looking for software for market research that’s easy to use, I recommend SPSS. That package is very dominant in the field of market research so knowing it is a good thing to have on your resume. It’s also quite easy to learn and use.
Cheers,
Bob
I am interested in your perceptions on why JMP attracts so little attention. I have been using it for years and find it far easier to teach students to use, easier to use myself, and quite powerful – extending a a number of machine learning methods. My own personal belief is that JMP has been handicapped by being owned by SAS – fear of product cannibalization. But academics have not embraced JMP to the extent I think it deserves. An additional factor in its favor is its extremely attractive academic licensing compared to all of the competition (except R of course).
Hi Dale,
JMP is really nice software. It’s so enjoyable to have all the graphs linked & interactive. I too have been surprised that it has not become one of the top stat packages. All I can guess is there’s so much competition.
Cheers,
Bob
This is a highly contrived and restricted study of dubious conclusions at best. The usage of SAS has risen quite dramatically in the academic world since SAS offered a very inexpensive if not free version to schools. I am seeing a huge influx in applicants out of school with SAS and or other analytical package skills (including the ones you mention). I work with > 4000 researchers in the Health Care field and we offer R, SAS, SPSS, Stata, Matlab, and many one-offs. SAS is the OVERWHELMING choice as 78% of the population (average age would be ~38) consume SAS, followed by Stata, SPSS and R. Many of the individuals who used R in school switch to the SAS within a 6 month period. My company is constantly receiving inquiries for SAS talent and we cannot keep up with the demand. I respectfully refute this findings as a bias study to push the usage of R for commercial remuneration.
Hi Mark,
I get the impression you did not actually read my “dubious” conclusions. Take a look at Figure 1a here, and you’ll feel much better.
Cheers,
Bob
What do you think of considering a combination instead of contrasting the different packages? JMP seems to be very user-friendly and could be used for everyday statistical work. In addition it has an interface to R for more complex and/or innovative statistical methods.
Hi Winfried,
I think that’s a great way to use R. JMP is a wonderful package and it has a good interface. Many packages – SAS, SPSS, Statistica – let you do all your work in the main package and then just call R for the thing they don’t yet do. I show the basics of R for such use, plus ways to call R from several other packages here.
Cheers,
Bob
Dear Bob,
Thanks for your positive feedback and the hint to your valuable document. I intent further exploring the way forward with JMP and R. I have 12 years of experience with JMP but I am just starting with R.
Best regards
Winfried
Interesting blog post. Thanks for sharing!
Here is one more way of looking at these data (total hits, and proportion of total hits):
https://docs.google.com/spreadsheets/d/1hJ2qg8F9G_1xLBiM6BdQnYwlYq2tGOa7PY65Dtx_uTU/pubhtml
Hi David,
Those are nice, thanks!
Bob
Very unscientific with anecdotal results. Conclusions are spurious at best. Any Researchers disagree with that?
Mark Ezzo,
Since I didn’t draw any conclusions, I will assume your comment is in response to the blog post rather than my comment even though you quoted me. I would not disagree that there are many significant limitations to his analysis and conclusions, not the least of which is the use of counts over rates, a point my graphs subtlely make for those astute enough to pick up on it. Maybe you could improve your feedback by offering your take on how his analysis could be improved. Better yet, offer actual analysis. If nothing else, his work is a conversation starter.
Hi Bob, I dont know a single thing about all the analysis and packages being discussed above (except the names may be such as R or SAS) but I still enjoyed reading the entire blog post as well as the comments. What impressed me most was the sane tone of all the comments and the commentators (no frothing at the mouth!) your willingness to look from the other person’s POV and easy acceptance and sometimes humorous responses. Wish both internet world and real world had loads more of sanity, rationality, and a willingess to agree or to agree to disagree…Too much idealism..? just wanted to share my appreciation of your and other commentators easy discussion.
Hi Raj,
Thanks for your comments. Having seen some flaming comments in other Internet discussions, I too am surprised that the comments here have remained so civil. While the first post a person makes requires my approval before it appears, I don’t recall turning anyone down for comments that were overly rude or lacking in useful content. Here’s hoping that it continues that way!
Cheers,
Bob
Adding few important points, comparing SAS vs R is not comparing apple to apple. If we compare R with “Complete SAS” then R is nowhere in the league. If we compare R with Base SAS, then R is better, as it gives all the facility for free which base SAS gives.
R is hard core programming, but open source (free), which can be used to develop enterprise wide analytics application. It has thousands of built in function to solve complex analytics problem. For non-programming background, little difficult to expertise. R is widely used in academic and SME.
SAS is a complete end to end solution from Data management to Data visualization to ETL to BI report to Advance analytics. Even with SAS EIS and SAS EF we can develop enterprise wide large scale application. Also SAS SCL can be also be used as general purpose OO programming language like Java, C++. More important aspect, SAS EG or E-miner has user friendly GUI, which can be used by non-programmer comfortably. In industry SAS is widely used due to below few reasons.
1. Data security is very high, for this reason in BFSI domain SAS is no. 1 choice.
2. SAS has over 250 industry specific “point and click” solution like credit and market risk, Asset/ Performance management, Fraud/ Pricing/ Marketing analytics.
3. SAS DI is a very powerful ETL tool which augments the power of SAS in Data management.
4. SAS BI provides the facility to produce world class dashboard in real time.
5. In corporate time has more value than product cost. 90% of Fortune 100 companies uses SAS.
6. SAS visual analytics analyzes Big data in real time with great visualization as “Tableu”.
In future when competition grows, Base SAS can be made free as SAS University edition. SAS is much more than just an analytic tool and it will always remain dominant in future.
Hi Wasifur,
You make many good points. I agree SAS is excellent software and not directly equivalent to R. On the other hand, R does many, many, things that SAS does not do. Flip through the categories here http://cran.rstudio.com/web/views/ and you’ll see what I mean. The cluster analysis section there is a good example of the staggering array of methods R offers. Whether or not you need them is another question, of course!
Cheers,
Bob
Can you please elaborate on the teaching of SAS with real-world examples being “… forbidden to academics by SAS and SPSS licenses”? Thanks
Hi SnoreHorse,
If you read your SAS or SPSS license carefully you’ll see that it forbids the use for the benefit of third parties. Solving real problems that companies face is a great way to learn, but even if the work is done for free, it’s forbidden by the license. This makes sense because the companies don’t want to sell one copy to a university who can then turn around and provide service to dozens of other companies, robbing them of revenue.
IBM offers greater flexibility on this front. Through their SPSS Academic Initiative, they allow problem solving for third parties, but only if the solution is publicly available. That’s a nice step in the right direction, but most companies don’t want their solutions made public. With open source software, the licenses allow you to do anything you like with the software.
Cheers,
Bob
1. Google Scholar reports on Academic uses of software. The reports in the market research or in banking, pharma, etc., usually do not get public. So the graph stand mostly for academic research.
2. Social sciences are among the heavy users/consumers of statistical packages.
This means that, in order to have a substantial interpretation of the graphs, one should also consider the practices in the academic communities in social sciences. What we see there, at least after the mid-2000s, is the proliferation of papers reporting multilevel (MLM) and structural equations models (SEM). They add to increasing availability of panel data (and, consequently, of reporting fixed effects models), and to a smaller trend to report latent-class-analysis models. This is the current standard in sociology, for instance.
Let me see what can SPSS properly perform of all these…. Well, I would say that it is not able to do anything as it should be done. (in the case of MLM, for instance, it lacks a proper handling of the weights).
Stata is much better, it has good procedures for MLM and panel data, and they have implemented SEM. SAS is the same, R also does panel data analysis and MLM.
With MLM, things are more complicated: I would expect HLM, MLWin and MPlus to have a larger share of the pie. The first two packages are dedicated to MLM, they have basically promoted the technique, and MlWin is free for users based in the UK universities.
With SEM, the best options remain Amos (from SPSS), LISREL, MPlus and EQS.
For LCA, there is Latent Gold, and MPlus.
MPlus is the only one (to my knowledge) to be able to combine SEM and MLM, which should give the software an important advantage in the near future, despite its higher costs.
(eViews may also be considered among the competitors in the graphs, due to its relative popularity among economists)
This means that the decreasing share of SPSS and SAS might be due to the increasing trends of the narrowly specialised packages. Forecasting should therefore consider at least the dynamics of these packages.
In my view, SPSS will continue to dominate the European market, particularly for its data-handling capabilities, which I find easier to understand than the ones in Stata (SAS is quite rare in European social sciences). However, it will be complemented (not replaced) by Amos, LISREL, MPlus, and R. (Remember, AMOS is SPSS). I would expect that IBM buys one of the MLM providers and integrates it with SPSS (as SPSS did in the 1990s with LISREL, to replace it later with Amos).
R still lacks constancy. For instance, with respect to the use of cross-classified models (a species of MLM): the lmer package in R changed several times in the past two years. Almost each update produced troubles: the previously written script did not run with the new version, a bug impeded the summary() function to display the results, etc. In the same time, Stata improved the speed of the xtmixed, which makes it superior. Mplus also improved…
Hi Bogdan,
You make an excellent point about specialty software like Amos, EQs, MPLUS, etc. being used in the types of articles that Google Scholar searches. I don’t track software that is so specialized simply due to lack of time.
Cheers,
Bob
So it’s been two years since this was posted and while R has continued is impressive growth it hasn’t taken over, nor have SPSS/SAS died out. The same debates over IDEs, GUIs, inconsistency/unreliability across versions, and various definitions of “free” are still playing out throughout academia. But in all of this I think something very important, showstoppingly important in fact, has been completely and totally overlooked: The quality of education new students are receiving.
I’ve been building computers for decades, I use linux full time on most of my work machines (even giving Arch a whirl for fun), I have some background with coding, and even I found myself left behind compared to my peers who had been taught SPSS or STATA. Other students without the benefit of that kind of background were simply lost. In two short tutoring sessions with my thesis chair I learned more about quantitative analysis and more practical skills to carry it out in SPSS than I did in an entire semester studying R under a professor who literally wrote a textbook on the subject. It’s a testament to how unfit for general consumption R is that even our comprehensive exams, produced by that same professor, were made using SPSS and not R.
I’m sure R has many wonderful advanced applications but unless universities plan on both investing in a dedicated staff to mantain their own functions and teaching students true proficiency in actual coding then they will accomplish nothing other than handicapping entire generations of students.
Hi Whipsmemory,
You can read the latest version of this topic here. You make an excellent point about the difficulty of using R. It took me far longer to get good at R than any other statistics package. You may enjoy reading my post, Why R is Hard to Learn.
Cheers,
Bob
In 1975, I became aware of SPSS, SAS, and BMDP? (from UCLA). I was asked by a professor if I was aware of SAS and SPSS. The answer was no. The development of SAS (or SPSS) was from a contract from the Navy to modernize the BMDP software. Gradually, I used both SAS and SPSS. I was unaware of the other statistical packages but became proficient with SAS and SPSS. I have not used those packages since 1991. Nevertheless, I was very impressed with them.
Hi Freddy,
BMDP was a great package! I guess you can still buy it, but I haven’t run it in years. I first used it and SPSS in 1978. UT helped SAS get started when it was just three people. We used to have the source code, but that was long ago.
Cheers,
Bob
I first heard of R about 10 years ago, we were using SPSS to teach basic statistics to students studying Ecology. The assumed mathematics background of our students is is best described at very basic Primary level. When I heard about R just verbally , I heard the term R GUI. Naively I first thought “Oh is a graphic user interface package like SPSS for example”. Of course when I downloaded R I was most disappointed the so called GUI part was just menu selections to download the massive numbers of all sorts of packages that for a first time user is just too overwhelming and then on top of that where hundreds of all sorts of obscure cryptic commands, functions and god knows what, it sort of had a C feel to it, but of course was more of an interpreter.
It did not take me long to just give up with whatever R was. But about 4 years ago had to teach a basic statistics course across three University Campuses. One in a remote location, the other close to a major city and the other in a city itself.
I then heard about the package R Commander, now I know I will get people look down their noses at this package, but it truly can be thought of as R with training wheels. The problem with SPSS at the time was that students could not access SPSS from their home. What we managed to do on all University Computers was to have a situation where when the a student clicked onto the R icon, the R Commander Package automatically would come up. I have found that using R Commander is not much different than using SPSS, if anything its easier if its taught properly..
Also as you use R Commander you see the scripting language in the input window and can thus see ways of modifying it and playing with it. We are seeing real positive results now with our students using an advanced freeware package, sure its not everything but what I have found is that now I am moving away from the R Commander Package starting to use R Studio, the more I use R the more I love it.
The great thing is that we have thousands of students now using a GUI package that gets them started, it does require good teaching methods, just like using any GUI statistics package
.
I too am really quite agnostic as to what mathematical tools I use, but I do love the idea of open source software and being in control as much as is possible in this day and age.
I really hope the future of R is very positive, but it does require a huge effort in early teaching to get hooked on it.
Hi Alan,
Thanks for that interesting story. I’m currently working with a combination of the free and open source KNIME package, which includes a nice interface to R. It uses a workflow approach, which allows you to re-use the “program” (diagram, really) on new data while still working in point-and-click mode. R Commander is nice, but to reuse the steps, you have to understand R. If you understood R, you wouldn’t need R Commander in the first place. I do think that R Commander shares many of the strengths and weaknesses of SPSS’ GUI interface.
Cheers,
Bob
As a usability researcher, there are maybe one or 2 instances per year where I need non-linear or logistic regression. Almost everything else can be done in Excel. I cannot cost justify a yearly license of $2,600. Is there just not the volume there for a lower cost? Development can’t cost that much on that product.