Gartner’s 2018 Take on Data Science Tools

I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2018 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging though all 40+ pages of my report, here’s just the new section:

IT Research Firms

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. While these reports focus on companies, they often also describe how their commercial tools integrate open source tools such as R, Python, H2O, TensoFlow, and others.

While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such free copies.

Gartner, Inc. is one of the companies that provides such reports.  Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue, or lower revenue combined with high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Hereafter, I refer to these as simply vision and ability. Figure 3a shows the resulting “Magic Quadrant” plot for 2018, and 3b shows the plot for the previous year.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The further to the upper-right corner, the better the combined score. KNIME is in the prime position, with H2O.ai showing greater vision but lower ability to execute. This year KNIME gained the ability to run H2O.ai algorithms, so these two may be viewed as complementary tools rather than outright competitors.

Alteryx and SAS have nearly the same combined scores, but note that Gartner studied only SAS Enterprise Miner and SAS Visual Analytics. The latter includes Visual Statistics, and Visual Data Mining and Machine Learning. Excluded was the SAS System itself since Gartner focuses on tools that are integrated. This lack of integration may explain SAS’ decline in vision from last year.

KNIME and RapidMiner are quite similar tools as they are both driven by an easy to use and reproducible workflow interface. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license. In the previous year’s Magic Quadrant, RapidMiner was slightly ahead, but now KNIME is in the lead.

Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms
Figure 3b. Gartner Magic Quadrant for Data Science Platforms 2017.

The companies in the Visionaries Quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, IBM took a big hit by landing here after being in the Leader’s Quadrant for several years. Now they’re in a near-tie with Microsoft and Domino. Domino shot up from the bottom of that quadrant to towards the top. They integrate many different open source and commercial software (e.g. SAS, MATLAB) into their Domino Data Science Platform. Databricks and Dataiku offer cloud-based analytics similar to Domino, though lacking in access to commercial tools.

Those in the Challenger’s Quadrant have ample resources but less customer confidence on their future plans, or vision. Mathworks, the makers of MATLAB, continues to “stay the course” with its proprietary tools while most of the competition offers much better integration into the ever-expanding universe of open source tools.  Tibco replaces Quest in this quadrant due to their purchase of Statistica. Whatever will become of the red-headed stepchild of data science? Statistica has been owned by four companies in four years! (Statsoft, Dell, Quest, Tibco) Users of the software have got to be considering other options. Tibco also purchased Alpine Data in 2017, accounting for its disappearance from Figure 3b to 3a.

Members of the Niche Players quadrant offer tools that are not as broadly applicable. Anaconda is new to Gartner coverage this year. It offers in-depth support for Python. SAP has a toolchain that Gartner calls “fragmented and ambiguous.”  Angoss was recently purchased by Datawatch. Gartner points out that after 20 years in business, Angoss has only 300 loyal customers. With competition fierce in the data science arena, one can’t help but wonder how long they’ll be around. Speaking of deathwatches, once the king of Big Data, Teradata has been hammered by competition from open source tools such as Hadoop and Spark. Teradata’s net income was higher in 2008 than it is today.

As of 2/26/2018, RapidMiner is giving away copies of the Gartner report here.

jamovi for R: Easy but Controversial

[An updated version of this post is located here.]

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally,  since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

Data Science Tool Market Share Leading Indicator: Scholarly Articles

Below is the latest update to The Popularity of Data Science Software. It contains an analysis of the tools used in the most recent complete year of scholarly articles. The section is also integrated into the main paper itself.

New software covered includes: Amazon Machine Learning, Apache Mahout, Apache MXNet, Caffe, Dataiku, DataRobot, Domino Data Labs, GraphPad Prism, IBM Watson, Pentaho, and Google’s TensorFlow.

Software dropped includes: Infocentricity (acquired by FICO), SAP KXEN (tiny usage), Tableau, and Tibco. The latter two didn’t fit in with the others due to their limited selection of advanced analytic methods.

Scholarly Articles

Scholarly articles provide a rich source of information about data science tools. Their creation requires significant amounts of effort, much more than is required to respond to a survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even an object of study.

Since graduate students do the great majority of analysis in such articles, the software used can be a leading indicator of where things are headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. Searching through concise job requirements (see previous section) is easier than searching through scholarly articles; however only software that has advanced analytical capabilities can be studied using this approach. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.  Since Google regularly improves its search algorithm, each year I re-collect the data for the previous years.

Figure 2a shows the number of articles found for the more popular software packages (those with at least 750 articles) in the most recent complete year, 2016. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 6/8/2017.

SPSS is by far the most dominant package, as it has been for over 15 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. SAS is in third place, still maintaining a substantial lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied. This is the first year that I’ve tracked Prism, a package that emphasizes graphics but also includes statistical analysis capabilities. It is particularly popular in the medical research community where it is appreciated for its ease of use. However, it offers far fewer analytic methods than the other software at this level of popularity.

Note that the general-purpose languages: C, C++, C#, FORTRAN, MATLAB, Java, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.

Figure 2a. Number of scholarly articles found in the most recent complete year (2016) for the more popular data science software. To be included, software must be used in at least 750 scholarly articles.

The next group of packages goes from Apache Hadoop through Python, Statistica, Java, and Minitab, slowly declining as they go.

Both Systat and JMP are packages that have been on the market for many years, but which have never made it into the “big leagues.”

From C through KNIME, the counts appear to be near zero, but keep in mind that each are used in at least 750 journal articles. However, compared to the 86,500 that used SPSS, they’re a drop in the bucket.

Toward the bottom of Fig. 2a are two similar packages, the open source Caffe and Google’s Tensorflow. These two focus on “deep learning” algorithms, an area that is fairly new (at least the term is) and growing rapidly.

The last two packages in Fig 2a are RapidMiner and KNIME. It has been quite interesting to watch the competition between them unfold for the past several years. They are both workflow-driven tools with very similar capabilities. The IT advisory firms Gartner and Forester rate them as tools able to hold their own against the commercial titans, SPSS and SAS. Given that SPSS has roughly 75 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newcomers are growing, while use of the older packages is shrinking quite rapidly. This plot shows RapidMiner with nearly twice the usage of KNIME, despite the fact that KNIME has a much more open source model.

Figure 2b shows the results for software used in fewer than 750 articles in 2016. This change in scale allows room for the “bars” to spread out, letting us make comparisons more effectively. This plot contains some fairly new software whose use is low but growing rapidly, such as Alteryx, Azure Machine Learning, H2O, Apache MXNet, Amazon Machine Learning, Scala, and Julia. It also contains some software that is either has either declined from one-time greatness, such as BMDP, or which is stagnating at the bottom, such as Lavastorm, Megaputer, NCSS, SAS Enterprise Miner, and SPSS Modeler.

Figure 2b. The number of scholarly articles for the less popular data science (those used by fewer than 750 scholarly articles in 2016.

While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time consuming. What I’ve done instead is collect data only for the past two complete years, 2015 and 2016. This provides the data needed to study year-over-year changes.

Figure 2c shows the percent change across those years, with the “hot” packages whose use is growing shown in red (right side); those whose use is declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 500 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth, but is still of little interest.

 

Figure 2c. Change in the number of scholarly articles using each software in the most recent two complete years (2015 to 2016). Packages shown in red are “hot” and growing, while those shown in blue are “cooling down” or declining.

Caffe is the data science tool with the fastest growth, at just over 150%. This reflects the rapid growth in the use of deep learning models in the past few years. The similar products Apache MXNet and H2O also grew rapidly, but they were starting from a mere 12 and 31 articles respectively, and so are not shown.

IBM Watson grew 91%, which came as a surprise to me as I’m not quite sure what it does or how it does it, despite having read several of IBM’s descriptions about it. It’s awesome at Jeopardy though!

While R’s growth was a “mere” 14.7%, it was already so widely used that the percent translates into a very substantial count of 5,300 additional articles.

In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot we also see that it’s continuing to pull away from KNIME with quicker growth.

From Minitab on down, the software is losing market share, at least in academia. The variants of C and Java are probably losing out a bit to competition from several different types of software at once.

In just the past few years, Statistica was sold by Statsoft to Dell, then Quest Software, then Francisco Partners, then Tibco! Did its declining usage drive those sales? Did the game of musical chairs scare off potential users? If you’ve got an opinion, please comment below or send me an email.

The biggest losers are SPSS and SAS, both of which declined in use by 25% or more. Recall that Fig. 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use.

I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.

Figure 2d. The number of scholarly articles found in each year by Google Scholar. Only the top six “classic” statistics packages are shown.

As in Figure 2a, SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.

Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 46 out of over 100 data science tools. SQL and Microsoft Excel could be taking up some of the slack, but it is extremely difficult to focus Google Scholar’s search on articles that used either of those two specifically for data analysis.

Since SAS and SPSS dominate the vertical space in Figure 2d by such a wide margin, I removed those two curves, leaving only two points of SAS usage in 2015 and 2016. The result is shown in Figure 2e.

 

Figure 2e. The number of scholarly articles found in each year by Google Scholar for classic statistics packages after the curves for SPSS and SAS have been removed.

Freeing up so much space in the plot allows us to see that the growth in the use of R is quite rapid and is pulling away from the pack. If the current trends continue, R will overtake SPSS to become the #1 software for scholarly data science use by the end of 2018. Note however, that due to changes in Google’s search algorithm, the trend lines have shifted before as discussed here. Luckily, the overall trends on this plot have stayed fairly constant for many years.

The rapid growth in Stata use seems to be finally slowing down.  Minitab’s growth has also seemed to stall in 2016, as has Systat’s. JMP appears to have had a bit of a dip in 2015, from which it is recovering.

The discussion above has covered but one of many views of software popularity or market share. You can read my analysis of several other perspectives here.

Dueling Data Science Surveys: KDnuggets & Rexer Go Live

What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.

For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll.  Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.

Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use.  It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors.  You can take the Rexer Analytics survey here (use Access Code M7UY4).  Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.

As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).

 

 

Group-By Modeling in R Made Easy

There are several aspects of the R language that make it hard to learn, and repeating a model for groups in a data set used to be one of them. Here I briefly describe R’s built-in approach, show a much easier one, then refer you to a new approach described in the superb book,  R for Data Science, by Hadley Wickham and Garrett Grolemund.

For ease of comparison, I’ll use some of the same examples in that book. The gapminder data set contains a few measurements for countries around the world every five years from 1952 through 2007.

> library("gapminder")
> gapminder

# A tibble: 1,704 × 6
 country continent year lifeExp pop gdpPercap
 <fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
# ... with 1,694 more rows

Let’s create a simple regression model to predict life expectancy from year. We’ll start by looking at just New Zealand.

> library("tidyverse")
> nz <- filter(gapminder, 
+              country == "New Zealand")
> nz_model <- lm(lifeExp ~ year, data = nz)
> summary(nz_model)

Call:
lm(formula = lifeExp ~ year, data = nz)

Residuals:
 Min 1Q Median 3Q Max 
-1.28745 -0.63700 0.06345 0.64442 0.91192

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -307.69963 26.63039 -11.55 4.17e-07 ***
year 0.19282 0.01345 14.33 5.41e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8043 on 10 degrees of freedom
Multiple R-squared: 0.9536, Adjusted R-squared: 0.9489 
F-statistic: 205.4 on 1 and 10 DF, p-value: 5.407e-08

If we had just a few countries, and we wanted to simply read the output (rather than processing it further) we could write a simple function and apply it using R’s built-in by() function. Here’s what that might look like:

my_lm <- function(df) {
  summary(lm(lifeExp ~ year, data = df))
}
by(gapminder, gapminder$country, my_lm)
...
----------------------------------------------- 
gapminder$country: Zimbabwe

Call:
lm(formula = lifeExp ~ year, data = df)

Residuals:
 Min 1Q Median 3Q Max 
-10.581 -4.870 -0.882 5.567 10.386 

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) 236.79819 238.55797 0.993 0.344
year -0.09302 0.12051 -0.772 0.458

Residual standard error: 7.205 on 10 degrees of freedom
Multiple R-squared: 0.05623, Adjusted R-squared: -0.03814 
F-statistic: 0.5958 on 1 and 10 DF, p-value: 0.458

Since we have so many countries, that wasn’t very helpful. Much of the output scrolled out of sight (I’m showing only the results for the last one, Zimbabwe). But in a simpler case, that might have done just what you needed. It’s a bit more complex than how SAS or SPSS would do it since it required the creation of a function, but it’s not too difficult.

In our case, it would be much more helpful to save the output to a file for further processing. That’s when things get messy. We could use the str() function to study the structure of the output, then write another function to extract the pieces we need, then apply that function, then continue to process the result until we get what we finally end up with a useful data frame of results. Altogether, that is a lot of work! To make matters worse, what you learned from all that is unlikely to generalize to a different function. The output’s structure, parameter names, and so on, are often unique to each of R’s modeling functions.

Luckily, David Robinson made a package called broom that simplifies all that. It has three ways to “clean up” a model, each diving more deeply into its details. Let’s see what it does with our model for New Zealand.

> library("broom")
> glance(nz_model)

  r.squared adj.r.squared sigma statistic p.value df
1 0.9535846 0.9489431 0.8043472 205.4459 5.407324e-08 2

   logLik      AIC    BIC   deviance df.residual
1 -13.32064 32.64128 34.096 6.469743     10

The glance() function gives us information about the entire model, and it puts it into a data frame with just one line of output. As we head towards doing a model for each country, you can imagine this will be a very convenient format.

To get a bit more detail, we can use broom’s tidy() function to clean up the parameter-level view.

> tidy(nz_model)
        term     estimate  std.error statistic p.value
1 (Intercept) -307.699628 26.63038965 -11.55445 4.166460e-07
2 year           0.192821  0.01345258  14.33339 5.407324e-08

Now we have a data frame with two rows, one for each model parameter, but getting this result was just as simple to do as the previous example.

The greatest level of model detail is provided by broom’s augment() function. This function adds observation-level detail to the original model data:

> augment(nz_model)

 lifeExp year  .fitted   .se.fit     .resid
1 69.390 1952 68.68692 0.4367774 0.70307692
2 70.260 1957 69.65103 0.3814859 0.60897203
3 71.240 1962 70.61513 0.3306617 0.62486713
4 71.520 1967 71.57924 0.2866904 -0.05923776
5 71.890 1972 72.54334 0.2531683 -0.65334266
6 72.220 1977 73.50745 0.2346180 -1.28744755
7 73.840 1982 74.47155 0.2346180 -0.63155245
8 74.320 1987 75.43566 0.2531683 -1.11565734
9 76.330 1992 76.39976 0.2866904 -0.06976224
10 77.550 1997 77.36387 0.3306617 0.18613287
11 79.110 2002 78.32797 0.3814859 0.78202797
12 80.204 2007 79.29208 0.4367774 0.91192308

        .hat    .sigma      .cooksd .std.resid
1 0.29487179 0.8006048 0.2265612817 1.04093898
2 0.22494172 0.8159022 0.1073195744 0.85997661
3 0.16899767 0.8164883 0.0738472863 0.85220295
4 0.12703963 0.8475929 0.0004520957 -0.07882389
5 0.09906760 0.8162209 0.0402635628 -0.85575882
6 0.08508159 0.7194198 0.1302005662 -1.67338100
7 0.08508159 0.8187927 0.0313308824 -0.82087061
8 0.09906760 0.7519001 0.1174064153 -1.46130610
9 0.12703963 0.8474910 0.0006270092 -0.09282813
10 0.16899767 0.8451201 0.0065524741 0.25385073
11 0.22494172 0.7944728 0.1769818895 1.10436232
12 0.29487179 0.7666941 0.3811504335 1.35014569

Using those functions was easy. Let’s now get them to work repeatedly for each country in the data set. The dplyr package, by Hadly Wickham and Romain Francois, provides an excellent set of tools for group-by processing. The dplyr package was loaded into memory as part of the tidyverse package used above. First we prepare the gapminder data set by using the group_by() function and telling it what variable(s) make up our groups:

> by_country <- 
+   group_by(gapminder, country)

Now any other function in the dplyr package will understand that the by_country data set contains groups, and it will process the groups separately when appropriate. However, we want to use the lm() function, and that does not understand what a grouped data frame is. Luckily, the dplyr package has a do() function that takes care of that problem, feeding any function only one group at a time. It uses the period “.” to represent each data frame in turn. The do() function wants the function it’s doing to return a data frame, but that’s exactly what broom’s functions do.

Let’s repeat the three broom functions, this time by country. We’ll start with glance().

> do(by_country, 
+    glance( 
+       lm(lifeExp ~ year, data = .)))

Source: local data frame [142 x 12]
Groups: country [142]

      country r.squared adj.r.squared sigma
    <fctr>      <dbl>      <dbl>    <dbl>
1 Afghanistan 0.9477123     0.9424835 1.2227880
2 Albania     0.9105778     0.9016355 1.9830615
3 Algeria     0.9851172     0.9836289 1.3230064
4 Angola      0.8878146     0.8765961 1.4070091
5 Argentina   0.9955681     0.9951249 0.2923072
6 Australia   0.9796477     0.9776125 0.6206086
7 Austria     0.9921340     0.9913474 0.4074094
8 Bahrain     0.9667398     0.9634138 1.6395865
9 Bangladesh  0.9893609     0.9882970 0.9766908
10 Belgium    0.9945406     0.9939946 0.2929025

# ... with 132 more rows, and 8 more variables:
# statistic <dbl>, p.value <dbl>, df <int>,
# logLik <dbl>, AIC <dbl>, BIC <dbl>,
# deviance <dbl>, df.residual <int>

Now rather than one row of output, we have a data frame with one row per country. Since it’s a data frame, we already know how to manage it. We could sort by R-squared, or correct the p-values for the number of models done using p.adjust(), and so on.

Next let’s look at the grouped parameter-level output that tidy() provides. This will be the same code as above, simply substituting tidy() where glance() had been.

> do(by_country, 
+    tidy( 
+      lm(lifeExp ~ year, data = .)))

Source: local data frame [284 x 6]
Groups: country [142]

country term estimate std.error
    <fctr>      <chr>        <dbl>     <dbl>
1 Afghanistan (Intercept) -507.5342716 40.484161954
2 Afghanistan year           0.2753287  0.020450934
3 Albania     (Intercept) -594.0725110 65.655359062
4 Albania     year           0.3346832  0.033166387
5 Algeria     (Intercept) -1067.8590396 43.802200843
6 Algeria     year            0.5692797  0.022127070
7 Angola      (Intercept)  -376.5047531 46.583370599
8 Angola      year            0.2093399 0.023532003
9 Argentina   (Intercept)  -389.6063445 9.677729641
10 Argentina  year            0.2317084 0.004888791

# ... with 274 more rows, and 2 more variables:
# statistic <dbl>, p.value <dbl>

Again, this is a simple data frame allowing us to do whatever we need without learning anything new. We can easily search for models that contain a specific parameter that is significant. In our organization, we search through salary models that contain many parameters to see if gender is an important predictor (hoping to find none, of course).

Finally, let’s augment the original model data by adding predicted values, residuals and so on. As you might expect, it’s the same code, this time with augment() replacing the tidy() function.

> do(by_country, 
+ augment( 
+ lm(lifeExp ~ year, data = .)))
Source: local data frame [1,704 x 10]
Groups: country [142]

   country   lifeExp year .fitted  .se.fit
    <fctr>     <dbl> <int> <dbl>    <dbl>
1 Afghanistan 28.801 1952 29.90729 0.6639995
2 Afghanistan 30.332 1957 31.28394 0.5799442
3 Afghanistan 31.997 1962 32.66058 0.5026799
4 Afghanistan 34.020 1967 34.03722 0.4358337
5 Afghanistan 36.088 1972 35.41387 0.3848726
6 Afghanistan 38.438 1977 36.79051 0.3566719
7 Afghanistan 39.854 1982 38.16716 0.3566719
8 Afghanistan 40.822 1987 39.54380 0.3848726
9 Afghanistan 41.674 1992 40.92044 0.4358337
10 Afghanistan 41.763 1997 42.29709 0.5026799

# ... with 1,694 more rows, and 5 more variables:
# .resid <dbl>, .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>

If we were to pull out just the results for New Zealand, we would see that we got exactly the same answer in the group_by result as we did when we analyzed that country by itself.

We can save that augmented data to a file to reproduce one of the residual plots from R for Data Science.

> gapminder_augmented <-
+ do(by_country, 
+   augment( 
+     lm(lifeExp ~ year, data = .)))
> ggplot(gapminder_augmented, aes(year, .resid)) +
+   geom_line(aes(group = country), alpha = 1 / 3) + 
+   geom_smooth(se = FALSE)

`geom_smooth()` using method = 'gam'

This plots the residuals of each country’s model by year by setting “group=country” then it follows it with a smoothed fit (geom_smooth) for all countries (blue line) by leaving out “group=country”. That’s a clever approach that I haven’t thought of before!

The broom package has done several very helpful things. As we have seen, it contains all the smarts needed to extract the important parts of models at three different levels of detail. It doesn’t just do this for linear regression though. R’s methods() function will show you what types of models broom’s functions are currently capable of handling:

> methods(tidy) 
 [1] tidy.aareg* 
 [2] tidy.acf* 
 [3] tidy.anova* 
 [4] tidy.aov* 
 [5] tidy.aovlist* 
 [6] tidy.Arima* 
 [7] tidy.betareg* 
 [8] tidy.biglm* 
 [9] tidy.binDesign* 
[10] tidy.binWidth* 
[11] tidy.boot* 
[12] tidy.brmsfit* 
[13] tidy.btergm* 
[14] tidy.cch* 
[15] tidy.character* 
[16] tidy.cld* 
[17] tidy.coeftest* 
[18] tidy.confint.glht* 
[19] tidy.coxph* 
[20] tidy.cv.glmnet* 
[21] tidy.data.frame* 
[22] tidy.default* 
[23] tidy.density* 
[24] tidy.dgCMatrix* 
[25] tidy.dgTMatrix* 
[26] tidy.dist* 
[27] tidy.ergm* 
[28] tidy.felm* 
[29] tidy.fitdistr* 
[30] tidy.ftable* 
[31] tidy.gam* 
[32] tidy.gamlss* 
[33] tidy.geeglm* 
[34] tidy.glht* 
[35] tidy.glmnet* 
[36] tidy.glmRob* 
[37] tidy.gmm* 
[38] tidy.htest* 
[39] tidy.kappa* 
[40] tidy.kde* 
[41] tidy.kmeans* 
[42] tidy.Line* 
[43] tidy.Lines* 
[44] tidy.list* 
[45] tidy.lm* 
[46] tidy.lme* 
[47] tidy.lmodel2* 
[48] tidy.lmRob* 
[49] tidy.logical* 
[50] tidy.lsmobj* 
[51] tidy.manova* 
[52] tidy.map* 
[53] tidy.matrix* 
[54] tidy.Mclust* 
[55] tidy.merMod* 
[56] tidy.mle2* 
[57] tidy.multinom* 
[58] tidy.nlrq* 
[59] tidy.nls* 
[60] tidy.NULL* 
[61] tidy.numeric* 
[62] tidy.orcutt* 
[63] tidy.pairwise.htest* 
[64] tidy.plm* 
[65] tidy.poLCA* 
[66] tidy.Polygon* 
[67] tidy.Polygons* 
[68] tidy.power.htest* 
[69] tidy.prcomp* 
[70] tidy.pyears* 
[71] tidy.rcorr* 
[72] tidy.ref.grid* 
[73] tidy.ridgelm* 
[74] tidy.rjags* 
[75] tidy.roc* 
[76] tidy.rowwise_df* 
[77] tidy.rq* 
[78] tidy.rqs* 
[79] tidy.sparseMatrix* 
[80] tidy.SpatialLinesDataFrame* 
[81] tidy.SpatialPolygons* 
[82] tidy.SpatialPolygonsDataFrame*
[83] tidy.spec* 
[84] tidy.stanfit* 
[85] tidy.stanreg* 
[86] tidy.summary.glht* 
[87] tidy.summary.lm* 
[88] tidy.summaryDefault* 
[89] tidy.survexp* 
[90] tidy.survfit* 
[91] tidy.survreg* 
[92] tidy.table* 
[93] tidy.tbl_df* 
[94] tidy.ts* 
[95] tidy.TukeyHSD* 
[96] tidy.zoo* 
see '?methods' for accessing help and source code
>

Each of those models contain similar information, but often stored in a completely different data structure and named slightly different things, even when they’re nearly identical. While that covers a lot of model types, R has hundreds more. David Robinson, the package’s developer, encourages people to request adding additional ones by opening an issue here.

I hope I’ve made a good case that doing group-by analyses in R can be done easily through the combination of dplyr’s do() function and broom’s three functions. That approach handles the great majority of group-by problems that I’ve seen in my 35-year career. However, if your needs are not met by this approach, then I encourage you to read Chapter 25 of R for Data Science (update: in the printed version of the book, it’s Chapter 20, Many Models with purrr and broom.) But as the chapter warns, it will “stretch your brain!”

If your organization is interested in a hands-on workshop that covers many similar topics, please drop me a line. Have fun with your data analyses!

Keeping Up with Your Data Science Options

The field of data science is changing so rapidly that it’s quite hard to keep up with it all. When I first started tracking The Popularity of Data Science Software in 2010, I followed only ten packages, all of them classic statistics software. The term data science hadn’t caught on yet, data mining was still a new thing. One of my recent blog posts covered 53 packages, and choosing them from a list of around 100 was a tough decision!

To keep up with the rapidly changing field, you can read the information on a package’s web site, see what people are saying on blog aggregators such as R-Bloggers.com or StatsBlogs.com, and if it sounds good, download a copy and try it out. What’s much harder to do is figure out how they all relate to one another. A helpful source of information on that front is the book Disruptive Analtyics, by Thomas Dinsmore.

I was lucky enough to be the technical reviewer for the book, during which time I ended up reading it twice. I still refer to it regularly as it covers quite a lot of material. In a mere 262 pages, Dinsmore manages to describe each of the following packages, how they relate to one another, and how they fit into the big picture of data science:

  • Alluxio
  • Alpine Data
  • Alteryx
  • APAMA
  • Apex
  • Arrow
  • Caffe
  • Cloudera
  • Deeplearning4J
  • Drill
  • Flink
  • Giraph
  • Hadoop
  • HAWQ
  • Hive
  • IBM SPSS Modeler
  • Ignite
  • Impala
  • Kafka
  • KNIME Analytics Platform
  • Kylin
  • MADLib
  • Mahout
  • MapR
  • Microsoft R Aerver
  • Phoenix
  • Pig
  • Python
  • R
  • RapidMiner
  • Samza
  • SAS
  • SINGA
  • Skytree Server
  • Spark
  • Storm
  • Tajo
  • Tensorflow
  • Tez
  • Theano
  • Trafodion

As you can tell from the title, a major theme of the book is how open source software is disrupting the data science marketplace. Dinsmore’s blog, ML/DL: Machine Learning, Deep Learning, extends the book’s coverage as data science software changes from week to week.

I highly recommend both the book and the blog. Have fun keeping up with the field!

Python and R Vie for Top Spot in Kaggle Competitions

I’ve just updated the Competition Use section of The Popularity of Data Science Software. Here’s just that section for your convenience.

Competition Use

Kaggle.com is a web site that sponsors data science contests. People post problems there along the amount of money they are willing pay the person or team who solves their problem the best. Both money and the competitors’ reputations are on the line, so there’s strong motivation to use the best possible tools. Figure 9 compares the usage of the top two tools chosen by the data scientists working on the problems. From April 2015 through July 2016, we see the usage of both R and Python growing at a similar rate. At the most recent time point Python has pulled ahead slightly. Much more detail is available here.

Figure 9. Software used in data science competitions on Kaggle.com in 2015 and 2016.

The Tidyverse Curse

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars)
                  mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0 6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0 6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8 4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4 6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1 6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method.  It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   print()
# A tibble: 32 × 11
   mpg   cyl  disp    hp  drat    wt  qsec    vs   am   gear  carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 21.0   6   160.0  110  3.90 2.620 16.46    0     1     4     4
 2 21.0   6   160.0  110  3.90 2.875 17.02    0     1     4     4
 3 22.8   4   108.0   93  3.85 2.320 18.61    1     1     4     1
 4 21.4   6   258.0  110  3.08 3.215 19.44    1     0     3     1
 5 18.7   8   360.0  175  3.15 3.440 17.02    0     0     3     2
 6 18.1   6   225.0  105  2.76 3.460 20.22    1     0     3     1
 7 14.3   8   360.0  245  3.21 3.570 15.84    0     0     3     4
 8 24.4   4   146.7   62  3.69 3.190 20.00    1     0     4     2
 9 22.8   4   140.8   95  3.92 3.150 22.90    1     0     4     2
10 19.2   6   167.6  123  3.92 3.440 18.30    1     0     4     4
# ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
Error in function_list[[i]](value) : 
 could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr")
> library("tibble")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
# A tibble: 32 × 12
 rowname            mpg   cyl disp    hp   drat   wt   qsec   vs    am   gear carb
  <chr>            <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4         21.0   6   160.0  110  3.90  2.620 16.46   0     1     4     4
2 Mazda RX4 Wag     21.0   6   160.0  110  3.90  2.875 17.02   0     1     4     4
3 Datsun 710        22.8   4   108.0   93  3.85  2.320 18.61   1     1     4     1
4 Hornet 4 Drive    21.4   6   258.0  110  3.08  3.215 19.44   1     0     3     1
5 Hornet Sportabout 18.7   8   360.0  175  3.15  3.440 17.02   0     0     3     2
6 Valiant           18.1   6   225.0  105  2.76  3.460 20.22   1     0     3     1
7 Duster 360        14.3   8   360.0  245  3.21  3.570 15.84   0     0     3     4
8 Merc 240D         24.4   4   146.7   62  3.69  3.190 20.00   1     0     4     2
9 Merc 230          22.8   4   140.8   95  3.92  3.150 22.90   1     0     4     2
10 Merc 280         19.2   6   167.6  123  3.92  3.440 18.30   1     0     4     4
# ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   print()
# A tibble: 1 × 1
 lyrics
 <chr>
1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   as.data.frame() %>%
+   print()
 ... <truncated>
1 a long long time ago i can still remember how that music used 
to make me smile and i knew if i had my chance that i could make 
those people dance and maybe theyd be happy for a while but 
february made me shiver with every paper id deliver bad news on 
the doorstep i couldnt take one more step i cant remember if i cried 
...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

Forrester’s 2017 Take on Tools for Data Science

In my ongoing quest to track The Popularity of Data Science Software, I’ve updated the discussion of the annual report from Forrester, which I repeat here to save you from having to read through the entire document. If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.

Forrester Research, Inc. is a company that provides reports which analyze the competitive position of tools for data science. The conclusions from their 2017 report, Forrester Wave: Predictive Analytics and Machine Learning Solutions, are summarized in Figure 3b. On the x-axis they list the strength of each company’s strategy, while the y-axis measures the strength of their current offering. The size and shading of the circles around each data point indicate the strength of each vendor in the marketplace (70% vendor size, 30% ISV and service partners).

As with Gartner 2017 report discussed above, IBM, SAS, KNIME, and RapidMiner are considered leaders. However, Forrester sees several more companies in this category: Angoss, FICO, and SAP. This is quite different from the Gartner analysis, which places Angoss and SAP in the middle of the pack, while FICO is considered a niche player.

Figure 3b. Forrester Wave plot of predictive analytics and machine learning software.

In their Strong Performers category, they have H2O.ai, Microsoft, Statistica, Alpine Data, Dataiku, and, just barely, Domino Data Labs. Gartner rates Dataiku quite a bit higher, but they generally agree on the others. The exception is that Gartner dropped coverage of Alpine Data in 2017. Finally, Salford Systems is in the Contenders section. Salford was recently purchased by Minitab, a company that has never been rated by either Gartner or Forrester before as they focused on being a statistics package rather than expanding into machine learning or artificial intelligence tools as most other statistics packages have (another notable exception: Stata). It will be interesting to see how they’re covered in future reports.

Compared to last year’s Forrester report, KNIME shot up from barely being a Strong Performer into the Leader’s segment. RapidMiner and FICO moved from the middle of the Strong Performers segment to join the Leaders. The only other major move was a lateral one for Statistica, whose score on Strategy went down while its score on Current Offering went up (last year Statistica belonged to Dell, this year it’s part of Quest Software.)

The size of the “market presence” circle for RapidMiner indicates that Forrester views its position in the marketplace to be as strong as that of IBM and SAS. I find that perspective quite a stretch indeed!

Alteryx, Oracle, and Predixion were all dropped from this year’s Forrester report. They mention Alteryx and Oracle as having “capabilities embedded in other tools” implying that that is not the focus of this report. No mention was made of why Predixion was dropped, but considering that Gartner also dropped coverage of then in 2017, it doesn’t bode well for the company.

For a much more detailed analysis, see Thomas Dinsmore’s  blog.

Jobs for “Data Science” Up 7-fold, for “Statistician” Down by Half

The Bureau of Labor Statistics projects that jobs for statisticians will grow by 34% between 2014 and 2024. However, according to the nation’s largest job web site, the number of companies looking for “statisticians” is actually in sharp decline. Those jobs are likely being replaced by postings for “data scientists.”

I regularly monitor the Popularity of Data Science Software, and as an offshoot of that project, I collected data that helps us understand how the term “data science” is defined. I began by finding jobs that required expertise in software used for data science such as R or SPSS. I then examined the tasks that the jobs entailed, such as “analyze data,” and looked up jobs based only on one task at a time. I switched back and forth between searching for software and for the terms used to describe the jobs, until I had a comprehensive list of both.  In the end, I had searched for over 50 software packages and over 40 descriptive terms or tasks. I had also skimmed thousands of job advertisements. (Additional details are here).

 

Search Terms 2/26/2017 2/17/2014 Ratio
Big Data 20,646            10,378 1.99
Data analytics 15,774              6,209 2.54
Machine learning 12,499              3,658 3.42
Statistical analysis 11,397              9,719 1.17
Data mining   9,757              7,776 1.25
Data Science 6,873                  973 7.06
Quantitative analysis 4,095              3,365 1.22
Business analytics  4,043              2,867 1.41
Advanced Analytics 3,479              1,497 2.32
Data Scientist 3,272                 974 3.36
Statistical software 2,835              2,102 1.35
Predictive analytics                 2,411              1,497 1.61
Artificial intelligence  2,404                 794 3.03
Predictive modeling 2,264              1,804 1.25
Statistical modeling 2,040              1,462 1.40
Quantitative research                 1,837              1,380 1.33
Research analyst                 1,756              1,722 1.02
Statistical tools                 1,414              1,121 1.26
Statistician 904              1,711 0.53
Statistical packages                    784                 559 1.40
Survey research 440                 559 0.79
Quantitative modeling                    352                  322 1.09
Statistical research 208                  174 1.20
Statistical computing                    153                  108 1.42
Research computing                    133                    97 1.37
Statistical analyst                    125                  141 0.89
Data miner 34                    19 1.79

Many terms were used outside the realm of data science. Other terms were used both in data science jobs and in jobs that require little analytic skill. Terms that could not be used to specifically find data science jobs were: analytics, data visualization, graphics, data graphics, statistics, statistical, survey, research associate, and business intelligence. One term, econometric(s), required deep analytical skills, but was too focused on one field.

The search terms that were well-focused on data science, but not overly focused in a single field are listed in the following table. The table is sorted by the number of jobs found on Indeed.com on February 26, 2017. While each column displays counts taken on a single day, the large size of Indeed.com’s database of jobs keeps its counts stable. The correlation between the logs of the two counts is quite strong, r=.95, p= 4.7e-14.

During this three-year period, the overall unemployment rate dropped from 6.7% to 4.7%, indicating a period of job growth for most fields. Three terms grew very rapidly indeed with “data science” growing 7-fold, and both “data scientist” and “artificial intelligence” tripling in size. The biggest surprise was that the use of the term “statistician” took a huge hit, dropping to only 53% of its former value.

That table covers a wide range of terms, but only on two dates. What does the long-term trend look like? Indeed.com has a trend-tracking page that lets us answer that question. The figure below shows solid the growth in the percentage of advertisements that used the term “data scientist” (blue, top right), while those using the term “statistician” (yellow, lower right) are steadily declining.

The plot on the company’s site is interactive (the one shown here is not) allowing me to see that the most recent data points were recorded on December 27, 2016. On that date, the percentage of jobs for data scientist were 474% of those for statistician.

As an accredited professional statistician, am I worried about this trend? Not at all. Statistical analysis software has broadened its scope to include many new capabilities including: machine learning, artificial intelligence, Structured Query Language, advanced visualization techniques, interfaces to Python, R, and Apache Spark. The software has changed because the job known as “statistician” has changed. Statisticians aren’t going away, their jobs are evolving into what we now know as data science. And that field is growing quite nicely!