The RSelenium R Package (Free Webinar)

The Orange County R User Group (OC-RUG) will soon host a free webinar on the new “RSelenium” R Package which provides a set of R bindings for the Selenium 2.0 webdriver using the JsonWireProtocol. Using RSelenium, you can automate web browsers locally or remotely in order to test various web apps, such as Shiny applications – for example.

Webinar Format:
- Introduction to the RSelenium R package
- Live Demonstration
- Question and Answer period

Date:  May 21, 2014 at 10 am Pacific (California) time
Speaker:
John Harrison, RSelenium package author/maintainer

For more information on the RSelenium package, please visit this site:

http://cran.r-project.org/web/packages/RSelenium

Please note that in addition to attending from your laptop or desktop computer, you can also attend from a Wi-Fi connected iPhone, iPad, Android phone or Android tablet by installing the GoToMeeting App.

Registration is below:

https://www3.gotomeeting.com/register/724626654

Posted in R, Statistics | Leave a comment

How Valuable is a #1 Ranking for Analytics Software? Not as Much as You Might Think!

In my never-ending quest to study the Popularity of Data Analysis Software, I recently read the 2013 Edition of the Wisdom of Crowds Business Intelligence Market Study by Dresner Advisory Services, LLC. In it, I found the table below which displays the “Wisdom of Crowds”, or what most people would call survey results.

Dresner-2013-5-20-Tableau

Wisdom of Crowds report showing Tableau with highest score.

As you can see, among high growth business intelligence vendors, Tableau comes out on top with a mean score of 4.40. Not long afterwards, I saw this press release that claimed, “TIBCO Spotfire Named the Leader in ‘High Growth Business Intelligence’ Market Segment. Spotfire Achieves ‘Best in Class’ in Wisdom of Crowds SME Business Intelligence Study.” That can’t be right, can it? I downloaded the report, expecting it to be for 2014. However, it was from the same year as the previous one I had read, 2013, and it contained the following familiar-looking table.

Table 1.

Wisdom of Crowds report showing Spotfire in first place.

To my surprise, Spotfire was now in first place, with Tableau in 3rd. Then I belatedly noticed the “SME” in Tibco’s headline. It turns out that the second report was a based on a subset of the first, selecting responses only from Small and Mid-sized Enterprises (SMEs). While the reports looked identical to me at first, the subset report does include “Small and Mid-sized Enterprises” right in its title. So both company claims are correct once you figure out who is being surveyed.

But what does this mean for the value of a #1 ranking from Dresner? The reports break the 23 vendors down into 5 groups: Titans, Large Established Pure-Play, High Growth, Specialized, and Emerging (they mention Early Stage ones too, but don’t rank them). This approach results in four or five vendors per table. By doing the report twice, once for all respondents and again for SMEs, there are 10 opportunities for companies to be #1 in any given year. In the two Dresner reports from 2013, 7 of the 23 companies are #1. Companies are more likely to purchase distribution rights to reports when they come out looking good. Since Dresner makes a living selling reports, that gives a whole new meaning to the term Business Intelligence!

Posted in Analytics, Graphics, Statistics | Leave a comment

JMBayes R package (webinar)

A free webinar will provide an introduction to the “JMBayes” R package which provides methods for Joint Modeling of Longitudinal and Time-to-Event Data under a Bayesian Approach.

Webinar Format:

- Introduction to Joint Models and the JMBayes R package
- Live demonstration
- Question and Answer period

Speaker:

- Dimitris Rizopoulos, JMBayes Package Maintainer

For more information on the JMBayes package, please visit this site:

http://cran.r-project.org/package=JMbayes

Please note that in addition to attending from your laptop or desktop computer, you can also attend from a Wi-Fi connected iPhone, iPad, Android phone or Android tablet by installing the GoToMeeting App.

Registration:

https://www3.gotomeeting.com/register/187219462

This event is brought to you by The Orange County R User Group.

Posted in Analytics, R, Statistics | Leave a comment

R Continues Its Rapid Growth

I’ve just updated the section below from The Popularity of Data Analysis Software. Note that the overall article is still under construction and all the figure numbers have changed from previous versions.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method.

Figure 8 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.998). The right-most point is for version 3.0.2, the last version released in 2013.

Fig_8_CRAN

Figure 8. Number of R packages plotted for each major release of R.

As rapid as this growth has been, these data represent only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 4/7/2014 counted 7,364 R packages at all major repositories, 5,323 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 38% higher on the y-axis than the one shown in Figure 8. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contains around 1,200  commands that are roughly equivalent to R functions (procs, functions etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, QC). R packages contain a median of 5 functions (Rasmus Bååth, 12/2012 personal communication). Therefore R has approximately 36,820 functions compared to SAS’s 1,200. In fact, during 2013 alone, R added more functions/procs than SAS Institute has written in its entire history! That’s 835 packages, counting only CRAN, or around 4,175 functions. Of course these are not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do. However, R functions can nest inside one another, creating nearly infinite combinations. Also, SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would be happy to list it here. While the comparison is not perfect, it does provide an interesting perspective on the size and growth rate of R.

Posted in Analytics, R, SAS, Statistics | Leave a comment

Learn to Manage Data at useR! 2014 or Online April 25

Before you can analyze data, it must be in the right form. Join me on April 25th for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr, lubridate and sqldf. I’ll also be presenting a 3-hour version at the UseR! 2014 conference. Here’s a list of the topics covered:

  1. Transformation basics
  2. Conditional transformations
  3. Summarization of columns and rows
  4. Summarization by group
  5. Analysis by group
  6. Sorting data
  7. Selecting first or last observation per group
  8. Miscellaneous variable tools (rename, keep, drop)
  9. Stacking data frames
  10. Finding and removing duplicate observations
  11. Merging data frames
  12. Reshaping data frames
  13. Character string manipulations
  14. Date / time manipulations (not in shorter useR! presentation)
  15. Using SQL within R (not in shorter useR! presentation)

R--113Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!

Posted in Analytics, Data Mangement, R, Statistics | Leave a comment

SAS, SPSS, Stata Users: Learn R from Home April 21

R--67

Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On April 21 and 23, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. A complete outline of the workshop plus a registration link is here.

Posted in Analytics, R, SAS, SPSS, Stata, Statistics | 2 Comments

Analytics Software Popularity Update: Counting Blogs, Simplifying Job Searches

My latest update to The Popularity of Data Analysis Software is an attempt to use blog counts to estimate the popularity of analytics software. While I was able to greatly broaden the coverage of packages when studying job data, I made very little progress on the blog measure, adding new coverage for only Python and updating the previous counts for Stata. I post the results here mostly as a request for input from people who may know of more sources of blog lists than I have found so far.

I’ve also updated the jobs data slightly both in the main article and in the background one, How to Search for Analytics Jobs. While the changes to the search algorithm are greatly simplified, but worth reading only by people who are doing their own searches. Rather than jump through hoops to estimate total jobs for each software, I only count those for the main set of search terms.  The relative results from the new search algorithm are nearly identical to the previous, more complex one (r = .99).

Here’s the update on blogs:

Blogs

On Internet blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Therefore, the number of bloggers writing about analytics software has potential as a measure of popularity or market share. Unfortunately, counting the number of relevant blogs is often a difficult task. General purpose software such as Java, Python, the C language variants and MATLAB have many more bloggers writing about general programming topics than just analytics. But separating them out isn’t easy. The name of a blog and the title of its latest post may not give you a clue that it routinely includes articles on analytics.

Another problem arises from the fact that what some companies would write up as a newsletter, others would do as a set of blogs, where several people in the company each contribute their own blog, but they’re also combined into a single company blog. Statsoft and Minitab offer examples of this. What’s really interesting is not company employees who are assigned to write blogs, but rather volunteers who freely provide their time.

In a few lucky cases, lists of such blogs are maintained, usually by blog consolidators, who combine many blogs into large “metablogs.” All I have to do is find such lists and count the blogs. I don’t attempt to extract the few vendor employees that I know are blended into such lists. However, I skip those lists that are exclusively employee-based (or very close to it). The results are shown in Table 1.

         Number
Software of Blogs Source
R         452     R-Bloggers.com
Python     60     SciPy.org
SAS        40     PROC-X.com, sasCommunity.org Planet
Stata      11     Stata-Bloggers.com

Table 1. Number of blogs devoted to each software package on March 5, 2014, and the source of the data.

R’s 452 blogs is quite an impressive number. For Python, I could only find that list of 60 that were devoted to the SciPy subroutine library. Some of those are likely cover topics besides analytics, but to determine which never cover the topic would be quite time consuming. The 40 blogs about SAS is still an impressive figure given that Stata was the only other software that even garnered a list anywhere. That list is at the vendor itself, StataCorp, but it consists of non-employees except for one.

While searching for lists of blogs on other software, I did find individual blogs that at least occasionally covered a particular topic. However, keeping this list up to date is far too time consuming given the relative ease with which other popularity measures are collected.

If you know of other lists of relevant blogs, please let me know and I’ll add them. If you’re a software vendor CEO reading this, and your company does not build a metablog or at least maintain a list of your bloggers, I recommend taking advantage of this important source of free publicity.

Posted in Analytics, R, SAS, Stata, Statistics | 3 Comments

r4stats.com 2013 in review

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 150,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 6 days for that many people to see it.

Click here to see the complete report.

Posted in Uncategorized | Leave a comment

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution R Enterprise, or SAS vs. the World Programming System (WPS) or Carolina, except when it is particularly easy such as comparing the company Pagerank figures.

These packages are all included in the first section on jobs, but later sections are older (each contains a date) and do not cover an as extensive set of software. I’ll add those as I can and announce the changes on Twitter where you can follow me as @BobMuenchen.

Introduction

When choosing a tool for data analysis, now more broadly referred to as analytics, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output in the form you prefer (e.g. cut & paste into a word processor vs. LaTeX integration)? Does it handle large enough data sets?  Do your colleagues use it so you can easily share data and programs? Can you afford it?

There are many ways to measure popularity or market share and each has its advantages and disadvantages. Here they are, in approximate order of usefulness:

  • Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
  • Published Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
  • Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
  • Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. What makes this measure particularly easy to gather is that consolidators like Tal Galili have created blog consolidation sites like R-Bloggers.com which make it easy to count the blogs. Previously that had been a difficult task.
  • Web Site Popularity – how does Google provide the most popular search results at the top of its response to your queries? A major component of that answer comes from the total number of web pages that point to any given web site. That’s known as a site’s PageRank. This is objective data, and for sites that clearly focus on analytics, it’s unbiased. However, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, and probably fewer that discuss analytics point to it as well. But it may be impractical to tell which is which.
  • Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
  • Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment.
  • Discussion Forums – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others.
  • Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
  • IT Research Firms – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
  • Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
  • Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.

Job Advertisements

One of the best ways to measure the popularity or market share of software for analytics is to count the number of job advertisements for each. Indeed.com is the biggest job site in the U.S. making its sample the most representative of the current job market. As their  CEO and co-founder Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, Careerbuilder, Hotjobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” To demonstrate just how dominant its lead is, a search for SPSS (on 2/19/14) showed more than ten times as many jobs on Indeed.com as on its well-known competitor, Monster.com. Indeed.com also has superb search capabilities and it even includes a tool for tracking long-term trends.

Searching for analytics jobs using Indeed.com can be easy, but it can also be very tricky. For many of the analytics software that required only a simple search on its name. However, for software that’s hard to locate (e.g. R) or that is general purpose (e.g. Java) it required complex searches and/or some rather tricky calculations which are described here. All of the graphs in this section use those procedures to make the required queries.

Figure 1a shows that Java and SAS are in a league of their own, with around 50% more analytics jobs than Python or C, C++/C# and twice as much as R. (The three aforementioned C variants are combined in a single search since job advertisements usually seek any of them). Python and C/C++/C#  come next at an almost identical level of popularity. That’s not too surprising as many advertisements for analytics jobs that use programming mention both together.

Figure 1a. The number of analytics jobs for the more popular software (2/2014).

Figure 1a. The number of analytics jobs for the more popular software (2/2014).

R resides in an interestingly large gap between the other domain-specific languages, SAS and SPSS. This is the first estimate I’ve done that shows that the job market for R has not only caught up with SPSS, but surpassed it by close to double the number of job postings. I knew my previous estimates for R jobs was low, but I had not yet thought of a better way to estimate the total. From SPSS on down, there’s a smooth decline. Enterprise Miner is the only data-mining-specific software to make the cutoff of at least 100 jobs. If I plotted all the software below that point, they would all pile up on the y-axis, appearing to have almost no jobs. Relatively speaking, they don’t!

Software that did not make that cut and are not displayed on the graph are: Alteryx (68), Statistica (67), RapidMiner (38), SPSS Modeler (36), KXEN (28), KNIME (26), Julia (15), Statgraphics (11), Systat (10), BMDP (8), Angos (6), Lavastorm (5), NCSS (4), Salford SPM etc. (3), Teradata Miner (2) and Oracle Data Mining (2).

It’s important to note that the values shown in Figure 1a are single points in time. The number of jobs for the more popular software do not change much from day to day, but each software has an overall trend that shows how the demand for jobs changes across the years. You can plot such trends using Indeed.com’s Job Trends tool. However, as before, focusing just on analytics jobs requires carefully constructed queries, and when comparing two trends at a time means they both have to fit in the same query limit allowed by Indeed.com. Those details are described here.

I’m particularly interested in trends involving R, so let’s look at a couple of comparisons. Figure 1b compares the number of analytics jobs available for R and SPSS across time. Analytics jobs for SPSS have not changed much over the years, while those for R have been steadily increasing. The jobs for R finally crossed over and exceeded those for SPSS toward the middle of 2012.

Fig_1b_RvSPSS_2014-2-22

Figure 1b. Analytics job trends for R and SPSS. Note that the legend labels are truncated due to the very long size of the query.

We know from Figure 1a that SAS is still far ahead of R in analytics job postings. How far does R have to go to catch up with SAS? Figure 1c provides one perspective. It would be nice to have the data to forecast when R’s growth curve will catch up with SAS’s, but Indeed.com does not provide the raw data. However, we can use the approximate slope of each line to get a rough estimate. If jobs for SAS stay level and those for R continue to grow linearly as they have since January 2010, then R will catch up in 3.35 years. If instead the demand for SAS jobs that started in January of 2012 continues, then R will catch up in 1.87 years.

Figure 1c. The trend in analytics jobs for R and SAS.

Figure 1c. Analytics job trends for R and SAS. Legend labels are truncated due to long query length.

A debate has been taking place on the Internet regarding the relative place of Python and R. Ironically, this debate about software to do data analytics has involved very little actual data. However it is possible now to at least study the job trends. Figure 1a showed us that Python is well out in front of R, at least on that single day the searches were run. What has the data looked like over time? The answer is in Figure 1d.

Figure 1c. Jobs trends for R and Python (2/22/14).

Figure 1d. Jobs trends for R and Python. Legend labels are truncated due to long query length.

Note that in this graph, Python appears to have a relatively slight advantage while in Figure 1a it had a huge one. The final point on the trend graph was done only two days after the queries used in Figure 1a, and that data changed very little in the meantime. The difference is due to the fact that Indeed.com has a limit on query length. Here is the query used for Figure 1c, and the analytic terms it contains were fewer than the one used for Figure 1a.

R 
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")
!"R D" !"A R" !"H R" !"R N" 
!toys !kids !" R Walgreen" !walmart
!"HVAC R" !"R Bard" 
,
python
and ("big data"
or "statistical analysis"
or "data mining"
or "data analytics"
or "machine learning"
or "quantitative analysis"
or "business analytics"
or "statistical software"
or "predictive modeling")

The detailed description regarding the construction of all the queries used in Figures 1a through 1c is located here.

==============================================================

At this point, the rest of The Popularity of Data Analysis Software will continue, offering many additional perspectives on measuring analytics market share. However, until I update those sections in the coming months, they will not cover as broad a range of software. Stay tuned on Twitter, by following @BobMuenchen.

If you know SAS, SPSS or Stata and have not yet learned R, you can join me for this web-based workshop aimed at translating your knowledge into R. The next workshop begins on April 21. If you do know R and would like to learn more, you might enjoy taking Managing Data with R. The next time I’m offering that is on April 25.

Posted in Analytics, R, SAS, SPSS, Stata, Statistics, Uncategorized | 16 Comments

Learn R and/or Data Management from Home January or April

R--67

If you want to learn R, or improve your current R skills, join me for two workshops that I’m offering through Revolution Analytics in January and April.

If you already know another analytics package, the workshop, Intro to R for SAS, SPSS and Stata Users may be for you. With each R concept, I’ll introduce it using terminology that you already know,  then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later, or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. You can see a complete out line and register for the workshop starting January 13 (click here) or April 21 (click here).

If you already know R, but want to learn more about how you can use R to get your data ready to analyze, the workshop Managing Data with R will demonstrate how to use the 15 most widely used data management tasks. The course outline and registration is available here for January and here for April.

If you have questions about any of these courses, drop me a line a muenchen.bob@gmail.com. I’m always available to answer questions regarding any of my books or workshops.

Posted in Analytics, Data Mangement, R, SAS, SPSS, Statistics | Tagged , | 2 Comments