R Passes SAS in Scholarly Use (finally)

Way back in 2012 I published a forecast that showed that the use of R for scholarly publications would likely pass the use of SAS in 2015. But I didn’t believe the forecast since I expected the sharp decline in SAS and SPSS use to level off. In 2013, the trend accelerated and I expected R to pass SAS in the middle of 2014. As luck would have it, Google changed their algorithm, somehow finding vast additional quantities of SAS and SPSS articles. I just collected data on the most recent complete year of scholarly publications, and it turns out that 2015 was indeed the year that R passed SAS to garner the #2 position. Once again, models do better than “expert” opinion! I’ve updated The Popularity of Data Analysis Software to reflect this new data and include it here to save you the trouble of reading the whole 45 pages of it.

If you’re interested in learning R, you might consider reading my books R for SAS and SPSS Users, or R for Stata Users. I also teach workshops on R, but I’m currently booked through mid October, so please plan ahead.

Figure 2a. Number of scholarly articles found in the most recent complete year (2015) for each software package.

Figure 2a. Number of scholarly articles found in the most recent complete year (2015) for each software package.

Scholarly Articles

Scholarly articles are also rich in information and backed by significant amounts of effort. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool or even an object of study. The software that is used in scholarly articles is what the next generation of analysts will graduate knowing, so it’s a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I re-collect the data for all years.

Figure 2a shows the number of articles found for each software package for the most recent complete year, 2015. SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. For the first time ever, R is in second place with around half as many articles. Although now in third place, SAS is nearly tied with R. Stata and MATLAB are essentially tied for fourth and fifth place. Starting with Java, usage slowly tapers off. Note that the general-purpose software C, C++, C#, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those as much rougher counts than the rest. Since Scala and Julia have a heavy data science angle to them, I cut them some slack by not adding any data science terms to the search, not that it helped them much!

From Spark on down, the counts appear to be zero. That’s not the case, the counts are just very low compared to the more popular packages, used in tens of thousands articles. Figure 2b shows the software only for those packages that have fewer than 1,200 articles (i.e. the bottom part of Fig. 2a), so we can see how they compare. Spark and RapidMiner top out the list of these packages, followed by KNIME and BMDP. There’s a slow decline in the group that goes from Enterprise Miner to Salford Systems. Then comes a group of mostly relative new arrivals beginning with Microsoft’s Azure Machine Learning. A package that’s not a new arrival is from Megaputer, whose Polyanalyst software has been around for many years now, with little progress to show for it. Dead last is Lavastorm, which to my knowledge is the only commercial package that includes Tibco’s internally written version of R, TERR.

Fig_2b_ScholarlyImpact2015

Figure 2b. The number of scholarly articles for software that was used by fewer than 1,200 scholarly articles (i.e. the bottom part of Fig. 2a, rescaled.)

Figures 2a and 2b are useful for studying market share as it is now, but they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting such data is too time consuming since it must be re-collected every year (since Google’s search algorithms change). What I’ve done instead is collect data only for the past two complete years, 2014 and 2015. Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red. Those whose use is declining or “cooling” are shown in blue. Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2014.

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2013 to 2014). Packages shown in red are "hot" and growing, while those shown in blue are "cooling down" or declining.

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2014 to 2015). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Python is the fastest growing. Note that the Python figures are strictly for data science use as defined here. The open-source KNIME and RapidMiner are the second and third fastest growing, respectively. Both use the easy yet powerful workflow approach to data science. Figure 2b showed that RapidMiner has almost twice the marketshare of KNIME, but here we see use of KNIME is growing faster. That may be due to KNIME’s greater customer satisfaction, as shown in the Rexer Analytics Data Science Survey. The companies are two of only four chosen by IT advisory firm Gartner, Inc. as having both a complete vision of the future and the ability to execute that vision (Fig. 3a).

R is in fourth place in growth, and given its second place in overall marketshare, it is in an enviable position.

At the other end of the scale are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use. Hadoop use declined slightly, perhaps as people turned to alternatives Spark and H2O.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I’ve plotted the same scholarly-use data for 1995 through 2015, the last complete year of data when this graph was made. As in Figure 2a, SPSS has a clear lead, but now you can see that its dominance peaked in 2008 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and it also peaked around 2008. Note that the decline in the number of articles that used SPSS or SAS is not balanced by the increase in the other software shown in this particular graph. However, if you add up all the other software shown in Figure 2a, you come close. There still seems to be a slight decline in people reporting the particular software tool they used.

Fig_2d_ScholarlyImpact

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only a single point of SAS usage in 2015. The the result is shown in Figure 2e. Freeing up so much space in the plot now allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack (recall that the curve for SAS has a steep downward slope). If the current trends continue, R will cross SPSS to become the #1 software for scholarly data science use by the end of 2017. Stata use is also growing more quickly than the rest. Note that trends have shifted before as discussed here. The use of Statistica, Minitab, Systat and JMP are next in popularity, respectively, with their growth roughly parallel to one another.

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after market leaders SPSS and SAS have been removed.

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Using a logarithmic y-axis scales down the more popular packages, allowing us to see the full picture in a single image (Figure 2f.) This view makes it more clear that R use has passed that of SAS, and that Stata use is closing in on it. However, even when one studies the y-axis values carefully, it can be hard to grasp how much the logarithmic transformation has changed the values. For example, in 2015 value for SPSS is well over twice the value for R. The original scale shown in Figure 2d makes that quite clear.

Fig_2f_ScholarlyImpactLogs

Figure 2f. A logarithmic view of the number of scholarly articles found in each year by Google Scholar. This combines the previous two figures into one by compressing the y-axis with a base 10 logarithm.

 

Posted in Analytics, Data Science, R, SAS, SPSS, Stata, Statistics, Uncategorized | 18 Comments

Rexer Data Science Survey: Satisfaction Results

by Bob Muenchen

I previously reported on the initial results of Rexer Analytics’ 2015 survey of data science tools here. More results are now available, and the comprehensive report should be released soon. One of the more interesting questions on the survey was, “Please rate your overall satisfaction with [your previously chosen software]." Most of the measures I report in my regularly-updated article, The Popularity of Data Analysis Software are raw measures of usage, so it’s nice to have data that goes beyond usage and into satisfaction. The results are show in the figure below for the more popular software (other software had very small sample sizes and so are not shown).

Rexer-2015-Satisfaction

Results from the question, “Please rate your overall satisfaction with [your previously chosen software]." Only software with substantial number of responses shown.

People reported being somewhat satisfied with their chosen tool, which doesn’t come a much of a surprise. If they weren’t at least somewhat satisfied, they would be likely to move on to another tool. What really differentiated the tools was the percent of people who reported being extremely satisfied. The free and open source KNIME program came out #1 with 69% of its users being extremely satisfied. (KNIME is also the 2nd fastest growing data science package among scholarly researchers). IBM SPSS Modeler came in second with 60%, followed closely by R with 57%.

Both of the top two packages use the workflow user interface which has many advantages that I’ve written about here and here. However, RapidMiner and SAS Enterprise Miner also use the workflow interface, and their percent of extremely satisfied customers were less than half at 32% and 29%, respectively. We might wonder if people are more satisfied with KNIME because they’re using the free desktop version, but RapidMiner also has a free version, so cost isn’t a factor on that comparison.

Although both R and SAS have menu-based interfaces, they are predominantly programming languages. R has almost triple the number of extremely satisfied users, which may be the result of its being generally viewed as the more powerful language, albeit somewhat harder to learn. The fact that R is free while SAS is not may also be a factor in that difference.

I’ve been learning KNIME and its interface to R, which looks like a stripped down version of RStudio. You can see a video demonstration by Heiko Hofer here. If you’re planning on attending the UseR! 2016 conference at Stanford University this year, stop by my poster session Helping Non-programmers Use R through the use of KNIME.

Posted in Analytics, Data Science, R, SAS, SPSS, Statistics, Uncategorized | 3 Comments

R Training at Nicholls State University

I’ll be presenting two workshops back-to-back at Nicholls State University in Thibodaux Louisiana June 14-16. The first workshop will cover a broad range of R topics. Each topic will include a brief comparison to how R differs from the popular commercial data science packages, SAS, SPSS, and Stata. If you have a background in any of those packages, you’ll know the things that are likely to trip you up as you learn R. If not, you’ll just come away with a solid intro to R along with how it compares to other software.

Nicholls-State

The second workshop will focus on the broad range of data management tasks that are usually needed to prepare your data for analysis. Both workshops will be done using the very latest R packages to make things as easy as possible. These packages include tibble, dplyr, magrittr, stringr, lubridate, broom, and more.

Seats are still available at no charge with the Nicholls’ State community and other academics getting priority seating. To register for the workshop, please contact professor Allyse Ferrara, 985/448-4736, or allyse.ferrara@nicholls.edu.

Posted in Analytics, Data Mangement, Data Science, R, Uncategorized | Leave a comment

R’s Growth Continues to Accelerate

Each year I update the growth in R’s capability on The Popularity of Data Analysis Software. And each year, I think R’s incredible rate of growth will finally slow down. Below is a graph of the latest data, and as you can see, R’s growth continues to accelerate.

Since I’ve added coverage for many more software packages, I have restructured the main article to reflect the value of each type of data. They now appear in this order:

  • Job Advertisements
  • Scholarly Articles
  • IT Research Firm Reports
  • Surveys of Use
  • Books
  • Blogs
  • Discussion Forum Activity
  • Programming Popularity Measures
  • Sales & Downloads
  • Competition Use
  • Growth in Capability

Growth in Capability remains last because I only have complete data for R. To save you from having to dig through all 40+ pages of the article, the updated section is below. I’ll be updating several other sections in the coming weeks. If you’re interested, you can follow this blog, or follow me on Twitter as @BobMuenchen.

If you haven’t yet learned R, I recommend my books R for SAS and SPSS Users and R for Stata Users. I do R training as well, but that’s booked up through the end of August, so please plan ahead.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/ for each version of R. To simplify ongoing data collection, I kept only the values for the last version of R released each year (usually in November or December), and collected data through the most recent complete year.

These data are displayed in Figure 10. The right-most point is for version 3.2.3, released 12/10/2015. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995).

Fig_9_CRAN

Figure 10. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions, etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, and QC). In 2015, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2015 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course while SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in Fig. 10. A program run on 4/19/2016 counted 11,531 R packages at all major repositories, 8,239 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 40% higher on the y-axis than the one shown in Figure 10.

As with any analysis software, individuals also maintain their own separate collections available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 19.78 functions per package. Given the package count of 11,531, as of 4/19/2016 there were approximately 228,103 total functions in R. In total, R has approximately 190 times as many commands as its main commercial competitor, SAS.

Posted in Analytics, Data Science, R, SAS, Uncategorized | 2 Comments

Advanced Analytics Software’s Most Important Feature? Gartner Says it’s VCF

The IT research firm, Gartner, Inc. has released its February 2016 report, Magic Quadrant for Advanced Analytics Platforms. The report’s main graph shows the completeness of each company’s vision plotted against its ability to achieve that vision (Figure 1.) I include this plot each year in my continuously updated article, The Popularity of Data Analysis Software, along with a brief summary of its major points. The full report is always interesting reading and, if you act fast, you can download it free from RapidMiner’s web site.

Figure 1. Gartner Magic Quadrant for 2016. What’s missing?

If you compare Figure 1 to last year’s plot (Figure 2), you’ll see a few noteworthy changes, but you’re unlikely to catch the radical shift that has occurred between the two. Both KNIME and RapidMiner have increased their scores slightly in both dimensions. KNIME is now rated as having the greatest vision within the Leaders quadrant. Given how much smaller KNIME Inc. is than IBM and SAS Institute, that’s quite an accomplishment. Dell has joined them in the Leaders quadrant through its acquisition of Statistica. Microsoft increased its completeness of vison, in part by buying Revolution Analytics. Accenture joined the category through its acquisition of i4C Analytics. LavaStorm and Megaputer entered the plot in 2016, though Gartner doesn’t specify why. These are all interesting changes, but they don’t represent the biggest change of all.

The watershed change between these two plots is hinted at by two companies that are missing in the more recent one: Salford Systems and Tibco. The important thing is why they’re missing. Gartner excluded them this year, “…due to not satisfying the [new] visual composition framework [VCF] inclusion criteria." VCF is the term they’re using to describe the workflow (also called streams or flowcharts) style of Graphical User Interface (GUI). To be included in the 2016 plot, companies must have offered software that uses the workflow GUI. What Garter is saying is, in essence, advanced analytics software that does not use the workflow interface is not worth following!

Gartner2015

Figure 2. Gartner Magic Quadrant for 2015.

Though the VCF terminology is new, I’ve long advocated its advantages (see What’s Missing From R). As I described there:

“While menu-driven interfaces such as R Commander, Deducer or SPSS are somewhat easier to learn, the flowchart interface has two important advantages. First, you can often get a grasp of the big picture as you see steps such as separate files merging into one, or several analyses coming out of a particular data set. Second, and more important, you have a precise record of every step in your analysis. This allows you to repeat an analysis simply by changing the data inputs. Instead, menu-driven interfaces require that you switch to the programs that they create in the background if you need to automatically re-run many previous steps. That’s fine if you’re a programmer, but if you were a good programmer, you probably would not have been using that type of interface in the first place!"

As a programming-oriented consultant who works with many GUI-oriented clients, I also appreciate the blend of capabilities that workflow GUIs provide. My clients can set up the level of analysis they’re comfortable with, and if I need to add some custom programming, I can do so in R or Python, blending my code right into their workflow. We can collaborate, each using his or her preferred approach. If my code is widely applicable, I can put it into distribution as a node icon that anyone can drag into their workflow diagram.

The Gartner report offers a more detailed list of workflow features. They state that such interfaces should support:

  • Interactive design of workflows from data sources to visualization, modeling and deployment using dragging and dropping of building blocks on a visual pallet
  • Ability to parameterize the building blocks
  • Ability to save workflows into files and libraries for later reuse
  • Creation of new building blocks by composing sets of building blocks
  • Creation of new building blocks by allowing a scripting language (R, JavaScript, Python and others) to describe the functionality of the input/output behavior

I would add the ability to color-code and label sections of the workflow diagram. That, combined with the creation of metanodes or supernodes (creating one new building block from a set of others) help keep a complex workflow readable.

Implications

If Gartner’s shift in perspective resulted in them dropping only two companies from their reports, does this shift really amount to much of a change? Hasn’t it already been well noted and dealt with? No, the plot is done at the company level. If it were done at the product level, many popular packages such as SAS (with its default Display Manager System interface) and SPSS Statistics would be excluded.

The fields of statistics, machine learning, and artificial intelligence have been combined psychologically by their inclusion into broader concepts such as advanced analytics or data science. But the separation of those fields is still quite apparent in the software tools themselves. Tools that have their historical roots in machine learning and artificial intelligence are far more likely to have implemented workflow GUIs. However, while they have a more useful GUI, they tend to still lack a full array of common statistical methods. For example, KNIME and RapidMiner can only handle very simple analysis of variance problems. When such companies turn their attention to this deficit, the more statistically oriented companies will face much stiffer completion. Recent versions of KNIME have already made progress on this front.

SPSS Modeler can access the full array of SPSS Statistics routines through its dialog boxes, but the two products lack full integration. Most users of SPSS Statistics are unaware that IBM offers control of their software through a better interface. IBM could integrate the Modeler interface into SPSS Statistics so that all its users would see that interface when they start the software. Making their standard menu choices could begin building a workflow diagram. SPSS Modeler could still be sold as a separate package, one that added features to SPSS Statistics’ workflow interface.

A company that is on the cutting edge of GUI design is SAS Institute. Their SAS Studio is, to the best of my knowledge, unique in its ability to offer four major ways of working. Its program editor lets you type code from memory using features far advanced from their aging Display Manager System. It also offers a “snippets" feature that lets you call up code templates for common tasks and edit them before execution. That still requires some programming knowledge, but users can depend less on their memory. The software also has a menu & dialog approach like SPSS Statistics, and it even has a workflow interface. Kudos to SAS Institute for providing so much flexibility! When students download the SAS University Edition directly from SAS Institute, this is the only interface they see.

SAS Studio currently supports a small, but very useful, percent of SAS’ overall capability. That needs to be expanded to provide as close to 100% coverage as possible. If the company can eventually phase out their many other GUIs (Enterprise Guide, Enterprise Miner, SAS/Assist, Display Manager System, SAS/IML Studio, etc.), merging that capability into SAS Studio, they might finally earn a reputation for ease of use that they have lacked.

In conclusion, the workflow GUI has already become a major type of interface for advanced analytics. My hat is off to the Gartner Group for taking a stand on encouraging its use. In the coming years, we can expect to see the machine learning/AI software adding statistical features, and the statistically oriented companies continuing to add more to their workflow capabilities until the two groups meet in the middle. The companies that get there first will have a significant strategic advantage.

Acknowledgements

Thanks to Jon Peck for suggestions that improved this post.

Posted in Analytics, R, SAS, SPSS, Statistics, Uncategorized | 2 Comments

Business Intelligence and Data Science Groups in East Tennessee

The Knoxville area has four groups that help people learn about business intelligence and data science.

The Knoxville R Users Group (KRUG) focuses on the free and open source R language. Each meeting begins with a bit of socializing followed by a series of talks given by its members or guests. The talks range from brief five-minute demos of an R function to 45-minute in-depth coverage of some method of analysis. Beginning tutorials on R are occasionally offered as well. Membership is free of charge, but donations are accepted to defray the cost of snacks and web site maintenance. You can join at the KRUG web site.

Data Science KNX is a group of people interested in the broad field of data science. Members range from beginners to experts. As their web site states, their “…aim is to maintain a forum for connecting people around data science specific topics such as tutorials and their applications, local success stories, discussions of new technologies, and best practices. All are welcome to attend, network, and present!" You can go here to join at the Data Science KNX web site. Membership is free, though the group gladly accepts donations to help defray the costs of the pizza and beer provided at their meetings.

The East Tennessee Business Intelligence Users Group is “committed to learning, sharing, and advancing the field of Business Intelligence in the East Tennessee region." They meet several times each year featuring speakers who demonstrate business intelligence software such as IBM’s Watson and Microsoft’s PowerBI. Meetings are at lunch and a meal is provided by sponsoring companies. Membership is free, and so is the lunch! You can join the group at their web site.

Each spring and fall, The University of Tennessee’s Department of Business Analytics and Statistics offers a Business Analytics Forum that features speakers from both industry and academia. The group consists of non-competing companies for whom business analytics is an important part of their operation. Forum members work together to share best practices and to develop more effective strategies. The forum is open to paid members only and you can join on their registration page.

Posted in Analytics, R, Statistics, Uncategorized | Tagged , | Leave a comment

Using Discussion Forum Activity to Estimate Analytics Software Market Share

I’m finally getting around to overhauling the Discussion Forum Activity section of The Popularity of Data Analysis Software. To save you the trouble of reading all 43 pages, I’m posting just this section below.

Discussion Forum Activity

Another way to measure software popularity is to see how many people are helping one another use each package or language. While such data is readily available, it too has its problems. Menu-driven software like SPSS or workflow-driven software such as KNIME are quite easy to use and tend to generate fewer questions. Software controlled by programming requires the memorization of many commands and requiring more support. Even within languages, some are harder to use than others, generating more questions (see Why R is Hard to Learn).

Another problem with this type of data is that there are many places to ask questions and each has its own focus. Some are interested in a classical statistics perspective while others have a broad view of software as general-purpose programming languages. In recent years, companies have set up support sites within their main corporate web site, further splintering the places you can go to get help. Usage data for such sites is not readily available.

Another problem is that it’s not as easy to use logic to focus in on specific types of questions as it was with the data from job advertisements and scholarly articles discussed earlier. It’s also not easy to get the data across time to allow us to study trends. Finally, the things such sites measure include: software group members (a.k.a. followers), individual topics (a.k.a. questions or threads), and total comments across all topics (a.k.a. total posts). This makes combining counts across sites problematic.

Two of the biggest sites used to discuss software are LinkedIn and Quora. They both display the number of people who follow each software topic, so combining their figures makes sense. However, since the sites lack any focus on analytics, I have not collected their data on general purpose languages like Java, MATLAB, Python or variants of C. The results of data collected on 10/17/2015 are shown here:

LinkedIn_Quora_2015

We see that R is the dominant software and that moving down through SAS, SPSS, and Stata results in a loss of roughly half the number of people in each step. Lavastorm follows Stata, but I find it odd that there was absolutely zero discussion of Lavastorm on Quora. The last bar that you can even see on this plot is the 62 people who follow Minitab. All the ones below that have tiny audiences of fewer than 10.

Next let’s examine two sites that focus only on statistical questions: Talk Stats and Cross Validated. They both report the number of questions (a.k.a. threads) for a given piece of software, allowing me to total their counts:

CrossValidated_TalkStats_2015

We see that R has a 4-to-1 lead over the next most popular package, SPSS. Stata comes in at 3rd place, followed by SAS. The fact that SAS is in fourth place here may be due to the fact that it is strong in data management and report writing, which are not the types of questions that these two sites focus on. Although MATLAB and Python are general purpose languages, I include them here because the questions on this site are within the realm of analytics. Note that I collected data on as many packages as were shown in the previous graph, but those not shown have a count of zero. Julia appears to have a count of zero due to the scale of the graph, but it actually had 5 questions on Cross Validated.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R? I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Posted in Analytics, R, SAS, SPSS, Stata, Uncategorized | 28 Comments

Rexer Analytics Survey Results

Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.

Surveys of Use

One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics conducts such a survey every other year, asking a wide range of questions regarding data science (previously referred to as data mining by the survey itself.) Figure 6a shows the tools that the 1,220 respondents reported using in 2015.

Figure 6a. Analytics tools used.

Figure 6a. Analytics tools used by respondents to the Rexer Analytics Survey. In this view, each respondent was free to check multiple tools.

We see that R has a more than 2-to-1 lead over the next most popular packages, SPSS Statistics and SAS. Microsoft’s Excel Data Mining software is slightly less popular, but note that it is rarely used as the primary tool. Tableau comes next, also rarely used as the primary tool. That’s to be expected as Tableau is principally a visualization tool with minimal capabilities for advanced analytics.

The next batch of software appears at first to be all in the 15% to 20% range, but KNIME and RapidMiner are listed both in their free versions and, much further down, in their commercial versions. These data come from a “check all that apply” type of question, so if we add the two amounts, we may be over counting. However, the survey also asked, “What one (my emphasis) data mining / analytic software package did you use most frequently in the past year?” Using these data, I combined the free and commercial versions and plotted the top 10 packages again in figure 6b. Since other software combinations are likely, e.g. SAS and Enterprise Miner; SPSS Statistics and SPSS Modeler; etc. I combined a few others as well.

Figure 6b. The percent of survey respondents who checked each package as their primary tool.

Figure 6b. The percent of survey respondents who checked each package as their primary tool. Note that free and commercial versions of KNIME and RapidMiner are combined. Multiple tools from the same company are also combined. Only the top 10 are shown.

In this view we see R even more dominant, with over a 3-to-1 advantage compared to the software from IBM SPSS and SAS Institute. However, the overall ranking of the top three didn’t change. KNIME however rises from 9th place to 4th. RapidMiner rises as well, from 10th place to 6th. KNIME has roughly a 2-to-1 lead over RapidMiner, even though these two packages have similar capabilities and both use a workflow user interface. This may be due to RapidMiner’s move to a more commercially oriented licensing approach. For free, you can still get an older version of RapidMiner or a version of the latest release that is quite limited in the types of data files it can read. Even the academic license for RapidMiner is constrained by the fact that the company views “funded activity” (e.g. research done on government grants) the same as commercial work. The KNIME license is much more generous as the company makes its money from add-ons that increase productivity, collaboration and performance, rather than limiting analytic features or access to popular data formats.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R? I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Posted in Analytics, R, SAS, SPSS | Tagged , , , | 12 Comments

Goals for the New R Consortium

by Bob Muenchen

The recently-created R Consortium consists of companies that are deeply involved in R such as RStudio, Microsoft/Revolution Analytics, Tibco, and others. The Consortium’s goals include advancing R’s worldwide promotion and support, encouraging user adoption, and improving documentation and tools. Those are admirable goals and below I suggest a few specific examples that the consortium might consider tackling.

R Consortium

As I work with various organizations to help them consider migrating to R, common concerns are often raised. With thousands of packages to choose from, where do I start? Do packages go through any reliability testing? What if I start using a package and its developer abandons it? These, and others, are valid concerns that the R Consortium could address.

Choosing Packages

New R users face a daunting selection of thousands of packages. Some guidance is provided by CRAN’s Task Views. In R’s early years, this area was quite helpful in narrowing down a package search. However, R’s success has decreased the usefulness of Task Views. For example, say a professor asks a grad student to look into doing a cluster analysis. In SAS, she’ll have to choose among seven procedures. When considering the Task View on the subject, she’ll be presented with 105 choices in six categories! The greater selection is one of R’s strengths, but to encourage the adoption of R by a wider community it would be helpful to list the popularity of each package. The more popular packages are likely to be the most useful.

R functions are integrated into other software such as Alteryx, IBM SPSS Statistics, KNIME, and RapidMiner. Some are also called from R user interfaces such as Deducer, R Commander, and RATTLE. Within R, some packages depend on others, adding another vote of confidence. The R Consortium could help R users by documenting these various measures of popularity, perhaps creating an overall composite score.

Accuracy

People often ask how they can trust the accuracy (or reliability) of software that is written by a loosely knit group of volunteers, when there have even been notable lapses in the accuracy of commercial software developed by corporate teams [1]. Base R and its “recommended packages" are very well tested, and the details of the procedures are documented in The R Software Development Life Cycle. That set of software is substantial, the equivalent of Base SAS + GRAPH + STAT + ETS + IML + Enterprise Miner (excluding GUIs, Structural Equation Modeling, and Multiple Imputation, which are in add-on packages). Compared to SPSS, it’s the rough equivalent to IBM SPSS Base + Statistics + Advanced Stat. + Regression + Forecasting + Decision Trees + Neural Networks + Bootstrapping.

While that set is very capable, it still leaves one wondering about all the add-on packages. Performing accuracy tests is very time consuming work [2-5] and even changing the options on the same routine can affect accuracy [6]. Increasing the confidence that potential users have in R’s accuracy would help to increase the use of the software, one of the Consortium’s goals. So I suggest that they consider ways to increase the reliability testing of functions that are outside the main R packages.

Given the vast number of R packages available, it would be impossible for the Consortium to perform such testing on all packages. However, for widely used packages, it might behoove the Consortium to use its resources to develop such tests themselves. A web page that referenced Consortium testing, as well as testing from any source, would be helpful.

Package Longevity

If enough of a package’s developers got bored and moved on or, more dramatically, were hit by the proverbial bus, development would halt. Base R plus recommended packages has the whole R Development Core Team backing them up. Other packages are written by employees of companies. In such cases, it is unclear whether the packages are supported by the company or by the individual developer(s).

Using the citation function will list a package’s developers. The more there are, the better chance there is of someone taking over if the lead developer moves on. The Consortium could develop a rating system that would provide guidance along these lines. Nothing lasts forever, but knowing the support level a package has would be of great help when choosing which to use.

Encourage Support and Use of Key Generic Functions

Some fairly new generic functions play a key role in making R easier to use. For example, David Robinson’s broom package contains functions that translate the output of modeling functions from list form into data frames, making output management much easier. Other packages, including David Dahl’s xtable and Philip Leifeld’s texreg, do a similar translation to present the output in nicely formatted forms for publishing. Those developers have made major contributions to R by writing all the methods themselves. The R Consortium could develop a list of such functions and encourage other developers to add methods to them, when appropriate. Such widely applicable functions could also benefit from having the R Consortium support their development, assuring longer package longevity and wider use.

Output to Microsoft Word

R has the ability to create beautiful output in almost any format you would like, but it takes additional work. Its competition, notably SAS and SPSS, let you choose the font and full formatting of your output tables at installation. From then on, any time you want to save output to a word processor, it’s a simple cut & paste operation. SPSS even formats R output to fully formatted tables, unlike any current R IDEs. Perhaps the R Consortium could pool the resources needed to develop this kind of output. If so, it would be a key aspect of their goal of speeding R’s adoption. (I do appreciate the greater power of LaTeX and the ease of use of knitr and Rmarkdown, but they’ll never match the widespread use of Word.)

Graphical User Interface

Programming offers the greatest control over an analysis, but many researchers don’t analyze data often enough to become good programmers; many simply don’t like programming. Graphical User Interfaces (GUIs) help such people get their work done more easily. The traditional menu-based systems, such as R Commander or Deducer, make one-time work easy, but they don’t offer a way to do repetitive projects without relying on the code that non-programmers wish to avoid.

Workflow-based GUIs are also easy to use and, more importantly, they save all the steps as a flowchart. This allows you to check your work and repeat it on another data set simply by updating the data import node(s) and clicking “execute." To take advantage of this approach, Microsoft’s Revolution R Enterprise integrates into Alteryx and KNIME, and Tibco’s Enterprise Runtime for R integrates into KNIME as well. Alteryx is a commercial package, and KNIME is free and open source on the desktop. While both have commercial partners, each can work with the standard community version of R as well.

Both packages contain many R functions that you can control with a dialog box. Both also allow R programmers to add a programming node in the middle of the workflow. Those nodes can be shared, enabling an organization to get the most out of both their programming and non-programming analysts. Both systems need to add more R nodes to be considered general-purpose R GUIs, but they’re making fairly rapid progress on that front. In each system, it takes less than an hour to add a node to control a typical R function.

The R Consortium could develop a list of recommended steps for developers to consider. One of these steps could be adding nodes to such GUIs. Given the open source nature of R, encouraging the use of the open source version of KNIME would make the most sense. That would not just speed the adoption of R, it would enable its adoption by the large proportion of analysts who prefer not to program. For the more popular packages, the Consortium could consider using their own resources to write such nodes.

Conclusion

The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider? Let’s hear your ideas in the comments section below.

Is your organization still learning R? I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Acknowledgements

Thanks to Drew Schmidt and Michael Berthold for their suggestions that improved this post.

References

  1. Micah Altman (2002), A Review of JMP 4.03 With Special Attention to its Numerical Accuracy, The American Statistician, 56:1, 72-75, DOI: 10.1198/000313002753631402
  2. D. McCullough (1998), Assessing the Reliability of Statistical Software: Part I, The American Statistician, 52:4, 358-366
  3. D. McCullough (1999), Assessing the Reliability of Statistical Software: Part II, The American Statistician, 53:2, 149-159
  4. Kellie B. Keeling, Robert J. Pavur (2007), A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Vol. 51, Issue 8, pp. 3811-3831
  5. Oluwartotimi O. Odeh, Allen M. Featherstone and Jason S. Bergtold (2010), Reliability of Statistical Software, American Journal of Agricultural Economics,doi: 1093/ajae/aaq068
  6. Jason S. Bergtold, Krishna Pokharel and Allen Featherstone (2015), Selected Paper prepared for presentation at the 2015 Agricultural & Applied Economics Association and Western Agricultural Economics Association Annual Meeting, San Francisco, CA, July 26-28
Posted in R | 42 Comments

Free Webinar: Intro to SparkR

Are you interested in combining the power of R and Spark? An “Intro to SparkR”
webinar will take place on July 15, 2015 at 10 am California time. Everyone is welcome
to attend.

Agenda:
– What is SparkR?
– Recent improvements to SparkR
– SparkR Roadmap
– Live Demo
– Q & A

Speaker:
Shivaram Venkataraman, Co-author of SparkR

Duration: 45-60 minutes

Cost: $0

Location: Internet

Registration:
https://attendee.gotowebinar.com/register/4761879673365920770

Posted in Analytics, R, Statistics | 14 Comments