I’ve updated The Popularity of Data Science Software‘s market share estimates based on scholarly articles. I posted it below, so you don’t have to sift through the main article to read the new section.
Scholarly Articles
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 4,500 articles) in the most recent complete year, 2022.
Figure 2a. The number of scholarly articles found on Google Scholar for data science software. Only those with more than 4,500 citations are shown.
SPSS is the most popular package, as it has been for over 20 years. This may be due to its balance between power and its graphical user interface’s (GUI) ease of use. R is in second place with around two-thirds as many articles. It offers extreme power, but as with all languages, it requires memorizing and typing code. GraphPad Prism, another GUI-driven package, is in third place. The packages from MATLAB through TensorFlow are roughly at the same level. Next comes Python and Scikit Learn. The latter is a library for Python, so there is likely much overlap between those two. Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest. Old stalwart FORTRAN appears last in this plot. While its count seems close to zero, that’s due to the wide range of this scale, and its count is just over the 4,500-article cutoff for this plot.
Continuing on this scale would make the remaining packages appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 4,500 rather than the 110,000 used in Figure 2a. I chose that cutoff value because it allows us to see two related sets of tools on the same plot: workflow tools and GUIs for the R language that make it work much like SPSS.
Figure 2b. Number of scholarly articles using each data science software found using Google Scholar. Only those with fewer than 4,500 citations are shown.
JASP and jamovi are both front-ends to the R language and are way out front in this category. The next R GUI is R Commander, with half as many citations. Still, that’s far more than the rest of the R GUIs: BlueSky Statistics, Rattle, RKWard, R-Instat, and R AnalyticFlow. While many of these have low counts, we’ll soon see that the use of nearly all is rapidly growing.
Workflow tools are controlled by drawing 2-dimensional flowcharts that direct the flow of data and models through the analysis process. That approach is slightly more complex to learn than SPSS’ simple menus and dialog boxes, but it gets closer to the complete flexibility of code. In order of citation count, these include RapidMiner, KNIME, Orange Data Mining, IBM SPSS Modeler, SAS Enterprise Miner, Alteryx, and R AnalyticFlow. From RapidMiner to KNIME, to SPSS Modeler, the citation rate approximately cuts in half each time. Orange Data Mining comes next, at around 30% less. KNIME, Orange, and R Analytic Flow are all free and open-source.
While Figures 2a and 2b help study market share now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each software, but collecting that much data is too time-consuming. Instead, I’ve collected data only for the years 2019 and 2022. This provides the data needed to study growth over that period.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side) and the declining or “cooling” ones shown in blue (left side).
Figure 2c. Change in Google Scholar citation rate from 2019 to the most recent complete year, 2022. BlueSky (2,960%) and jamovi (452%) growth figures were shrunk to make the plot more legible.
Seven of the 14 fastest-growing packages are GUI front-ends that make R easy to use. BlueSky’s actual percent growth was 2,960%, which I recoded as 220% as the original value made the rest of the plot unreadable. In 2022 the company released a Mac version, and the Mayo Clinic announced its migration from JMP to BlueSky; both likely had an impact. Similarly, jamovi’s actual growth was 452%, which I recoded to 200. One of the reasons the R GUIs were able to obtain such high percentages of change is that they were all starting from low numbers compared to most of the other software. So be sure to look at the raw counts in Figure 2b to see the raw counts for all the R GUIs.
The most impressive point on this plot is the one for PyTorch. Back on 2a we see that PyTorch was the fifth most popular tool for data science. Here we see it’s also the third fastest growing. Being big and growing fast is quite an achievement!
Of the workflow-based tools, Orange Data Mining is growing the fastest. There is a good chance that the next time I collect this data Orange will surpass SPSS Modeler.
The big losers in Figure 2c are the expensive proprietary tools: SPSS, GraphPad Prism, SAS, BMDP, Stata, Statistica, and Systat. However, open-source R is also declining, perhaps a victim of Python’s rising popularity.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d, I have plotted the same scholarly-use data for 1995 through 2016.
Figure 2d. The number of Google Scholar citations for each classic statistics package per year from 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009, and its use is in sharp decline. SAS never came close to SPSS’s level of dominance, and its usage peaked around 2010. GraphPad Prism followed a similar pattern, though it peaked a bit later, around 2013.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 2015. The result is shown in Figure 2e.
Figure 2e. The number of Google Scholar citations for each classic statistics package from 1995 through 2016, with SPSS removed and SAS included only in 2014 and 2015. The removal of SPSS and SAS expanded scale makes it easier to see the rapid growth of the less popular packages.
Figure 2e shows that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. The decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this graph.
These results apply to scholarly articles in general. The results in specific fields or journals are likely to differ.
You can read the entire Popularity of Data Science Software here; the above discussion is just one section.
I have recently updated my extensive analysis of the popularity of data science software. This update covers perhaps the most important section, the one that measures popularity based on the number of job advertisements. I repeat it here as a blog post, so you don’t have to read the entire article.
Job Advertisements
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what will become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g., scikit-learn, Apache Spark), while others are used in data science jobs and, more broadly, in report-writing jobs (e.g., SAS, Tableau). General-purpose languages (e.g., Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that require them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the results in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on October 5, 2022. To measure percent change, I compare that to data collected on May 27, 2019. One might think that a sample on a single day might not be very stable, but they are. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002. I occasionally double-check some counts a month or so later and always get similar figures.
The number of jobs covers a very wide range from zero to 164,996, with a mean of 11,653.9 and a median of 845.0. The distribution is so skewed that placing them all on the same graph makes reading values difficult. Therefore, I split the graph into three, each with a different scale. A single plot with a logarithmic scale would be an alternative, but when I asked some mathematically astute people how various packages compared on such a plot, they were so far off that I dropped that approach.
Figure 1a shows the most popular tools, those with at least 10,000 jobs. SQL is in the lead with 164,996 jobs, followed by Python with 150,992 and Java with 113,944. Next comes a set from C++/C# at 48,555, slowly declining to Microsoft’s Power BI at 38,125. Tableau, one of Power BI’s major competitors, is in that set. Next comes R and SAS, both around 24K jobs, with R slightly in the lead. Finally, we see a set slowly declining from MATLAB at 17,736 to Scala at 11,473.
Figure 1a. Number of data science jobs for the more popular software (>= 10,000 jobs).
Figure 1b covers tools for which there are between 250 and 10,000 jobs. Alteryx and Apache Hive are at the top, both with around 8,400 jobs. There is quite a jump down to Databricks at 6,117 then much smaller drops from there to Minitab at 3,874. Then we see another big drop down to JMP at 2,693 after which things slowly decline until MLlib at 274.
Figure 1b. Number of jobs for less popular data science software tools, those with between 250 and 10,000 jobs.
The least popular set of software, those with fewer than 250 jobs, are displayed in Figure 1c. It begins with DataRobot and SAS’ Enterprise Miner, both near 182. That’s followed by Apache Mahout with 160, WEKA with 131, and Theano at 110. From RapidMiner on down, there is a slow decline until we finally hit zero at WPS Analytics. The latter is a version of the SAS language, so advertisements are likely to always list SAS as the required skill.
Figure 1c. Number of jobs for software having fewer than 250 advertisements.
Several tools use the powerful yet easy workflow interface: Alteryx, KNIME, Enterprise Miner, RapidMiner, and SPSS Modeler. The scale of their counts is too broad to make a decent graph, so I have compiled those values in Table 1. There we see Alteryx is extremely dominant, with 30 times as many jobs as its closest competitor, KNIME. The latter is around 50% greater than Enterprise Miner, while RapidMiner and SPSS Modeler are tiny by comparison.
Enterprise Miner
SPSS Modeler
Table 1. Job counts for workflow tools.
Let’s take a similar look at packages whose traditional focus was on statistical analysis. They have all added machine learning and artificial intelligence methods, but their reputation still lies mainly in statistics. We saw previously that when we consider the entire range of data science jobs, R was slightly ahead of SAS. Table 2 shows jobs with only the term “statistician” in their description. There we see that SAS comes out on top, though with such a tiny margin over R that you might see the reverse depending on the day you gather new data. Both are over five times as popular as Stata or SPSS, and ten times as popular as JMP. Minitab seems to be the only remaining contender in this arena.
Jobs only for “Statistician”
Table 2. Number of jobs for the search term “statistician” and each software.
Next, let’s look at the change in jobs from the 2019 data to now (October 2022), focusing on software that had at least 50 job listings back in 2019. Without such a limitation, software that increased from 1 job in 2019 to 5 jobs in 2022 would have a 500% increase but still would be of little interest. Percent change ranged from -64.0% to 2,479.9%, with a mean of 306.3 and a median of 213.6. There were two extreme outliers, IBM Watson, with apparent job growth of 2,479.9%, and Databricks, at 1,323%. Those two were so much greater than the rest that I left them off of Figure 1d to keep them from compressing the remaining values beyond legibility. The rapid growth of Databricks has been noted elsewhere. However, I would take IBM Watson’s figure with a grain of salt as its growth in revenue seems nowhere near what the Indeed.com’s job figure seems to indicate.
The remaining software is shown in Figure 1d, where those whose job market is “heating up” or growing are shown in red, while those that are cooling down are shown in blue. The main takeaway from this figure is that nearly the entire data science software market has grown over the last 3.5 years. At the top, we see Alteryx, with a growth of 850.7%. Splunk (702.6%) and Julia (686.2%) follow. To my surprise, FORTRAN follows, having gone from 195 jobs to 1,318, yielding growth of 575.9%! My supercomputing colleagues assure me that FORTRAN is still important in their area, but HPC is certainly not growing at that rate. If any readers have ideas on why this could occur, please leave your thoughts in the comments section below.
Figure 1d. Percent change in job listings from March 2019 to October 2022. Only software that had at least 50 jobs in 2019 is shown. IBM (2,480%) and Databricks (1,323%) are excluded to maintain the legibility of the remaining values.
SQL and Java are both growing at around 537%. From Dataiku on down, the rate of growth slows steadily until we reach MLlib, which saw almost no change. Only two packages declined in job advertisements, with WEKA at -29.9%, Theano at -64.1%.
This wraps up my analysis of software popularity based on jobs. You can read my ten other approaches to this task at https://r4stats.com/articles/popularity/. Many of those are based on older data, but I plan to update them in the first quarter of 2023, when much of the needed data will become available. To receive notice of such updates, subscribe to this blog, or follow me on Twitter: https://twitter.com/BobMuenchen.
One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found each year for each software. He chose Google Scholar since it searches “across many disciplines and sources: articles, theses, books, abstracts, and court opinions, from academic publishers, professional societies, online repositories, universities, and other web sites.” Figure 1 shows the results from 1995 through 2016. Data collected in 2018 showed that while SPSS use dropped 39% drop from 2017 to 2018, its use was still 66% higher than R in 2018.
Figure 1. Number of citations per year for each statistics package, found by Google Scholar, from 1995 to 2016.
We see in the plot that SPSS was extremely dominant for most of that time period. Even after its precipitous decline, it still beats the rest by more than a 2 to 1 margin. Over the years, several people questioned the accuracy of Figure 1. In a time when scholarly publications are proliferating, how could SPSS use be in such decline?
One hypothesis that has often been suggested revolves around one of the most bizarre product name changes in the history of marketing. As a result of a legal battle for control of the name “SPSS”, the SPSS company changed the name of the product to “PASW”, an acronym for Predictive Analytics Software. The change made about as much sense as Coke people renaming Coke to “BSW”, for Bubbly Sugar Water. The battle was settled and in 2011 and the product name reverted back to SPSS.
Could that name change account for the apparent
decline in its use? A search on Google Scholar from 2009 to 2012 on the string:
yielded 12,000 hits. That sounds like quite a few, but when “SPSS” was substituted for “PASW” in that search, we found 701,000 references. At first glance, it seems that the scholarly use of SPSS was undercounted by 1.7%. However, when searching a vast volume of documents, each string may have problems with over-counting. For example, PASW stands for “Plant Available Soil Water” which accounts for 138 of those 12,000 articles. There may be many other such abbreviations. That’s the type of analysis Muenchen did several years ago, before concluding that PASW was more trouble than it was worth (details are here). In 2018 that search yields only 361 hits, and the title of the very first article begins with, “Projections Analysis of Surface Waves (PASW)…”
Muenchen’s hypothesis regarding the apparent decline of SPSS is that it was caused by competition. Back in 2002, SPSS shared the statistical software market with SAS and a couple of others. Its momentum carried it upward for a few more years, then the competition started chipping away at it. GraphPad Prism improved significantly with the release of its version 5 in 2007 and medical users of SPSS found an alternative that was as easy to use while focusing more on their needs. R added enough useful packages around the same time to become competitive. By now there are probably hundreds of packages that people can use to analyze data, only a few of which are shown in Figure 1.
Mackinnon remained skeptical of this hypothesis because the overall graph appears to show decreases in statistical software citation over time. This would seem to contradict evidence that the number of journal articles published has been increasing at about 3% per year over the last 3 centuries, and about 3.9% per year in the past decade (2018 STM Report, pg. 25). Thus, the total number of citations to statistical software as a collective group should be increasing concurrently with this overall increase.
Mackinnon gathered data from a different source: Scopus. According to Wikipedia, “Scopus covers nearly 36,377 titles from approximately 11,678 publishers, of which 34,346 are peer-reviewed journals in top-level subject fields: life sciences, social sciences, physical sciences, and health sciences.” Mackinnon limited the search to reference lists, reasoning that such citations are likely an indicator of using the software in the paper. Two search strings were used:
REF(“the R software” OR “the R
project” OR “r-project.org” OR “R development core”)
These searches are being a bit generous to SPSS by including Modeler and AMOS, and very conservative for R by not including citations to common packages (e.g., ggplot2). The resulting data are plotted in Figure 2.
Figure 2. Number of citations per year for each statistics package, found by Scopus, from 2000 to 2018.
we see that the citations of R in scholarly journals exceeded that of SPSS back
in 2012. However, the scale of Figure 2 tops out at 30,000 while Figure 1’s
scale peaks at 300,000. Google is finding a lot more documents! So, which of
these software packages is used the most in scholarly work? Good question! We would like to hear your comments below,
especially from readers who collect data from other sources.
In my ongoing quest to track The Popularity of Data Science Software, I’ve just updated my analysis of the job market. To save you from reading the entire tome, I’m reproducing that section here.
Job Advertisements
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what is likely to become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities. It used to have a job trend plotter, but that tool has apparently been shut down.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g. SPSS, Apache Spark) while others are used in data science jobs and more broadly in report-writing jobs (e.g. SAS, Tableau). General-purpose languages (e.g. Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that use them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the graphs in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on May 27, 2019 and February 24, 2017. One might think that a sample of on a single day might not be very stable, but the large number of job sources makes the counts in Indeed.com’s collection of jobs quite consistent. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002.
Figure 1a shows that Python is in the lead with 27,374 jobs, followed by SQL with 25,877. Java and Amazon’s Machine Learning (ML) tools are roughly 25% further below, with jobs in the 17,000s. R and the C variants come next with around 13,000. People frequently compare R and Python, but when it comes to getting a data science job, there are only half as many for R as for Python. That doesn’t mean they’re the same sort of job, of course. I still see more statisticians using R and machine learning people preferring Python, but Python is definitely on a roll! From Hadoop on down, there is a slow decline in jobs. R is also frequently compared to SAS, which has only 8,123 compared to R’s 13,800.
The scale of Figure 1a is so wide that the bottom package, H20 appears to be zero, when in fact there are 257 jobs for it.
Figure 1a. Number of data science jobs for the more popular software.
To let us compare the less popular software, I plotted them separately in Figure 1b. Mathematica and Julia are the leaders of this set, with around 219 jobs each. The ancient FORTRAN language is still hanging on to life with 195 jobs. The open source WEKA software and IBM’s Watson are next, with around 185 each. From XGBOOST on down, there is a fairly steady slow decline.
There are several tools that use a workflow interface: Enterprise Miner, KNIME, RapidMiner, and SPSS Modeler. They’re all around the same area between 50 and 100 jobs. In many of the other measures of popularity, RapidMiner beats the very similar KNIME tool, but here there are 50% more jobs for the latter. Alteryx is also a workflow-based tool, however, it has pulled away from the pack, appearing back on Figure 1a with 901 jobs.
Figure 1b. Number of jobs for less popular data science software tools, those with fewer than 250 advertisements.
When interpreting the scale on Figure 1b, what looks like zero is indeed zero. From Systat on down, none of the packages have more than 10 job listings.
It’s important to note that the values shown in Figures 1a and 1b are single points in time. The number of jobs for the more popular software do not change much from day to day. Therefore, the relative rankings of the software shown in Figure 1a is unlikely to change much over the coming year or two. The less popular packages shown in Figure 1b have such low job counts that their ranking is more likely to shift from month to month, though their position relative to the major packages should remain more stable.
Next, let’s look at the change in jobs from the 2017 data to now (2019). Figure 1c shows the percent change for those packages that had at least 100 job listings back in 2017. Without such a limitation, software that goes from 1 job in 2017 to 5 jobs in 2019 would have a 500% increase, but still would be of little interest. Software whose job market is heating up, or growing, is shown in red, while those that are cooling down are shown in blue.
Figure 1c. Percent change in job listings from 2017 to 2019. Only software that had at least 100 jobs in 2017 is shown.
Tensorflow, the deep learning software from Google, is the fastest growing at 523%. Next is Apache Flink, a tool that analyzes streaming data, at 289%. H2O is next, with 150% growth. Caffe is another deep learning framework and its 123% growth reflects the popularity of artificial intelligence algorithms.
Python shows “only” 97% growth, but its popularity was already so high that the 13,471 jobs that it added surpasses the total jobs of many of the other packages!
Tableau is showing a similar rate of growth, though it was a comparably small number of additional jobs, at 4,784.
From the Julia language on down, we see a slowing decrease in growth. I’m surprised to see that jobs for SAS and SPSS are still growing, though barely at 6% and 1%, respectively.
If you enjoyed reading this article, you might be interested in my recent series of reviews on point-and-click front-ends for the R language. I invite you to subscribe to this blog, or follow me on Twitter.
In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.
Scholarly Articles
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I collect data again for the previous years (with one exception noted below).
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 3/28/2019.
Figure 2a. The number of scholarly articles found on Google Scholar, for data science software. Only those with more than 1,700 citations are shown.
SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied.
Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.
The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.
The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.
To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.
Figure 2b. Number of scholarly articles using each data science software found using Google Scholar. Only those with fewer than 1,700 citations are shown.
I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages are growing, while the use of the older ones is shrinking quite rapidly.
Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools which are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that much less expensive – or free – depending on how you use them (KNIME desktop is free but server is not; RapidMiner is free for analyzing fewer than 10,000 cases).
Another interesting comparison on Figure 2b is JASP and jamovi. Both are open-source tools that focus on statistics rather than machine learning or artificial intelligence. They both use graphical user interfaces (GUIs) in a style that is similar to SPSS. Both also use R behind the scenes to do their calculations. JASP emphasizes Bayesian Analysis and hides its R code; jamovi has a more frequentist orientation, it lets you see its R code, and it lets you execute your own R code directly from within it. JASP currently has nine times as many citations here, though jamovi’s use is growing much more rapidly.
Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open source model a few months ago.
While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.
Figure 2c. Change in Google Scholar citation rate in the most recent complete two years, 2017 and 2018.
The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g. Mahout down, Spark up).
Of course another summary is: cheap (or free) up; expensive down. Of the growing packages, 13 out of 17 are available in open source. Of those in decline, only 5 out of 13 are open source.
Statistics software has been around much longer than AI/ML software, started back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them the more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.
At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP with 223% growth, and jamovi with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.
Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.
I was surprised to see IBM Watson growing a healthy 34% as much of the news about it has not been good. It’s awesome at Jeopardy though!
In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).
The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.
Figure 2d. The number of Google Scholar citations for each classic statistics package per year from 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010. GraphPAD Prism followed a similar pattern, though it peaked a bit later, around 2013.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 1015. The result is shown in Figure 2e.
Figure 2e. The number of Google Scholar citations for each classic statistics package from 1995 through 2016, this time with SPSS removed and SAS included only in 2014 and 2015. The removal of SPSS and SAS expanded scale makes it easier to see the rapid growth of the less popular packages.
Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.
While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in March, 2019. Since Google’s algorithm changes, preventing the new data from matching exactly with the old, this new data starts at 2015 so the two sets overlap. SPSS is not shown on this graph because its dominance would compress the y-axis, making trends in the others harder to see. However, keep in mind that despite SPSS’ 39% drop from 2017 to 2018, its use is still 66% higher than R’s in 2018! Apparently people are willing to pay for ease of use.
Figure 2f. The number of Google Scholar citations for each classic statistics package per year from 2015 through 2018.
In Figure 2f we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect R use to stabilize at this level while use of Stata continues to slowly decline.
SPSS’ long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.
These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the job advertisements that list science software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
Update: an earlier version of this post included figures that I’ve removed at the request of Forrester, Inc.
In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I describe Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.
Forrester Research, Inc. is a leading global research and advisory firm that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.
Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports.
The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018 covers software that is controllable by various means such as menus, workflows, wizards, or code (as of 23/22/2019 available free here). Forrester plans to cover tools for automated modeling in a separate report, due out in 2019. Given that automation is now a widely adopted feature of the several companies covered in this report, that seems like an odd approach.
Forrester divides the vendors into four categories: Leaders, Strong Performers, Contenders, and Challengers.
In the Leaders category, they include IBM, while Gartner viewed them as a middle-of-the-pack Visionary. Forrester and Gartner both view SAS and RapidMiner as leaders.
The Strong Performers category includes KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. Forrester has KNIME and SAP next to each other in this category, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating to Gartner.
The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.
Forrester’s Challengers segment includes World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems. Neither were considered by Gartner.
Forrester rates some of the notebook-based vendors very differently than Gartner. Here Domino Data Labs is a Leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a Leader, though its strength is this market is minimal.
In the Strong Performers category are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this category; neither were reviewed by Gartner. Cloudera is here as well; it too was left out by Gartner.
Forrester’s Condenders category contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.
The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.
As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2019 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging through all 40+ pages of my report, here’s just the updated section:
IT Research Firms
IT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open source software that has no specific company backing, such as R, Python, or jamovi. Even open source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open source software into their proprietary ones.
While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal companies that are distributing them. On the date of this post, Datarobot is offering free copies.
Gartner, Inc. is one of the research firms that write such reports. Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” That software performs a wide range of tasks including data importation, preparation, exploration, visualization, modeling, and deployment.
Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.
Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms from their 2019 report (plot done in November 2018, report released in 2019).
The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upper-right corner of the plot, the better the combined score.
RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in ability to execute, while KNIME offers more vision. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license with “difficult-to-navigate pricing conditions.” These two offer very similar workflow-style user interfaces and have the ability to integrate many open sources tools into their workflows, including R, Python, Spark, and H2O.
Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixed-effects linear models. Those models are among the very few that R often fails to solve.
The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.
Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources such as images, video, and the Internet of Things (IoT). It has also opened up more to open source deep learning projects.
H2O.ai is also in the Visionaries quadrant. This is the company behind the open source H2O software, which is callable from many other packages or languages including R, Python, KNIME, and RapidMiner. While its own menu-based interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for non-coders. H2O’s strength is in modeling but it is lacking in data access and preparation, as well as model management.
IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of open-source tools into its Apache Spark-based system. However, it scored poorly in the predictability of costs.
Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a non-trivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
Microsoft offers several products, but only its cloud-only Azure Machine Learning (AML) was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for ease-of-use, scalability, and strong partnerships. However, it is weak in automated modeling and AML’s relation to various other Microsoft components is overwhelming (same problem as Google’s toolset).
Figure 3b. Last year’s Gartner Magic Quadrant for Data Science and Machine Learning Platforms (January, 2018)
Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans, or vision.
Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflow-based user interface is very similar to that of KNIME and RapidMiner, and it too gets top marks in ease-of-use. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geo-coded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting edge modeling methods such as deep learning, auto-modeling, and the Internet of Things.
Dataiku strengthed its ability to execute significantly from last year. It added better scalability to its ease-of-use and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”
Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (includes the former Angoss), Domino, and SAP.
Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to production-oriented organizations. This is a tool exclusively for experts in both programming and data science.
Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on ease-of-use and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools, but low marks for data access and prep, integrating models into day-to-day operations, and customer support.
SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.
To see many other ways to rate this type of software, see my ongoing article, The Popularity of Data Science Software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
It has been just a few months since I reviewed five free and open-source point-and-click graphical user interfaces (GUIs) to the R language. I plan to keep those reviews up to date as new features are added. BlueSky’s interface would be immediately familiar to anyone who had used SPSS, as its developers model it on that popular software. The BlueSky developers’ goal is to help people use R without having to learn to write computer code.
While the previous version of BlueSky offered dozens of fairly advanced modeling methods such as generalized linear models, random forests, and support vector machines, it lacked some simpler features. Version 5.40 (correction: this was previously listed as version 6.04) adds a dialog for logistic regression, which is essentially its glm dialog simplified to do only logistic regression.
The Multi-Way ANOVA output has also been greatly enhanced, with the addition of a wide range of contrasts from the emmeans package, support for all three types of sums of squares, and plots for both post-hoc main-effects comparisons and interaction plots like this one:
Users of RStudio will be pleased that the BlueSky’s program editor now submits lines of code like RStudio does so you can step your way through a program line-by-line, by clicking the Run button repeatedly.
The new version also has functions to do string-to-date vice versa, which led me to realize I had totally missed the string and date functions that it already had. In the “Data> Compute” dialog, the functions for Arithmetic, Logical, Math, and String(1) are visible. But if you click on the “>>” arrow on the right, you’ll also see String (2), Conversion, Statistical, Random Numbers, and four different menus of Date functions.
The complete set of new features and bug fixes is below. You can read my full review of BlueSky here, and you download the software for free here. I plan to write about new features in other R GUIs, so stay tuned to my blog or follow me on twitter where I announce new posts. Happy computing!
1) Added support for weighted datasets. This option is available in Data -> Set Weights. Once you specify the weighting variable we create a new dataset with rows replicated as defined in the weights. This is similar to what SPSS does internally. When you run frequencies, independent sample t-tests, graphics commands, statistical tests etc on the new dataset (in BlueSky Statistics) with the rows replicated you will see results identical to those in SPSS.
2) Added the option to specify a weighting variable in Linear and Logistic Regression. This allows an optional vector of weights to be used during the fitting process (i.e. the weighted least-squares solution).
3) Added support for logistic regression under ‘Model Fitting’. Once the model is built, you can score the dataset, optionally obtain a confusion matrix, model statistics, and a ROC curve by selecting the model and clicking score. This is available on the top right-hand corner of the main application window.
4) We have updated the “Multi Way Anova” dialog with following capabilities:
display contrasts.
interaction plots.
support for type I, type II and type III tests.
pairwise comparison.
5) New reshape dialog with simplified R syntax has been added, using the tidyr package.
6) The “Multi variable one sample T-Test” and “Multi variable independent sample T-Test(with factor)” have been updated to allow you to specify the alternative hypothesis.
7) Added capabilities to support date manipulations.
The “String to date” dialog allows you to convert string to POSIXct date class.
The “Date to string” dialog allows you to convert the date (POSIXct and Date class) to string.
8) Simplified the syntax for frequencies and factor analysis.
9) If you launch a second instance of BlueSky Statistics the message that gets displayed has been improved. NOTE: You can only have one BlueSky Statistics instance running at a time.
10) We have improved the ability to browse the contents of the output window in BlueSky Statistics. This can be accessed from the menu (Layout > Show Navigation Tree).
11) Items in the output window can now be deleted. To delete an item from the output, just right click on the item(table/text/graphics) and choose the “Delete” option.
12) To make the output visually more appealing, we have introduced an option to hide the R syntax that gets displayed in the output window. This is controlled by an option in the configuration window ( Tools > Configuration Settings > Output tab ). By default we hide the R syntax in the output window.
13) When the option to show R syntax in the output is turned on ( Tools > Configuration Settings > Output tab ) and you resize the output window, we wrap the R syntax so that it is always visible.
14) To run any line of R syntax just place the cursor on that line and hit the RUN button, you don’t have to select the entire line.
15) If you want to run your R script line by line, place your cursor on the first line and hit RUN, the cursor will automatically move to the next line and you can hit the RUN button again.
This feature will work for simple R syntax which does not span multiple lines.
Example 1: 3 lines below work
Example 2: 4 lines below will not work
16)Added helpful hints to indicate you have reached the beginning and end of a dataset when scrolling wide datasets. This is available on the paging controls on the bottom right-hand corner of the screen.
17) Application launch is now faster.
18) In the BlueSky Statistics syntax editor, just like a comma, the pipe (%>%) can be used to break a long R code statement.
19) When the application launches, we open a new blank dataset this has been populated with zeros. You can right click on a row to delete a row or go into ‘Data -> Delete Variables’ to delete variables.
20) Clicking on a variable name in the data grid sorts in ascending order by that variable name. Clicking again sorts in descending order.
1) Fixed an issue with factor analysis when saving scores using the regression method.
2) Fixed an issue when re-editing factor levels that were previously changed in the variable grid. This would result in the dialog not functioning correctly and incorrect levels being added.
3) Fixed an issue when you closed the empty dataset that gets created when BlueSky Statistics is launched and then attempted to save open datasets- those datasets would not get saved correctly.
4) Fixed an issue that was limiting the number of variables that Factor Analysis can be run across.
5) Within block commands that use ‘local()’, cat(“\n”) can now be printed in the output to leave some extra spaces.
6) When you added a new factor variable and renamed the variable using the user interface and then tried to add new levels – this did not work and has been fixed.
7) Add new factor variable. Click in the cell where the new variable name is shown. Cell goes in edit mode. Now add new factor levels to this new variable. Switch to the data grid and select a different level. Switch back to ‘Variables’ tab and the application crashes. This has been fixed.
8) When any existing factor level name is modified using the user interface, a blank level automatically gets added. If you try to modify the level, it does not take effect. This has been addressed.
9) Disable data grid navigation buttons (on the lower right-hand side of the data grid) if there are less than 16 columns in the dataset.
10) Changing factor levels was not working because one or more levels had a single quote in the level name.
11) Aggregate control fixed: Text above the drop-down (that contains mean, median etc.) was getting chopped off. Similar issue with the label text above a textbox (which was almost at the bottom of the dialog).
12) Fixed significance codes for:
One Sample T Test and Independent Sample T Test
Multivariable one sample t-test and Multivariable Independent one sample t-test (with factor)
13) Fixed a defect: Select some syntax and hit RUN. After execution, the cursor goes to the top of the script. Now it moves to the next line.
14) Data grid navigation buttons are disabled if either end of the datagrid is reached. If there are no more columns on the right, the right navigation button is disabled. If there are no more columns on the left then left navigation button is disabled.
15) Left navigation tree in the output window is fixed for look and feel. It now has a cleaner look. To access left navigation, go to Layout -> Show navigation tree
16) In the R syntax editor, we now ignore square, curly and round brackets that appear inside single or double quotes. See example below:
Most data sets are easy to enter using the following rules.
All your data should be in a single spreadsheet of a single file (for an exception to this rule, see Relational Data Sets below.)
Enter variable names in the first row of the spreadsheet.
Consider the length of your variable names. If you know for sure what software you will use, follow its rules for how many characters names can contain. When in doubt, use variable names that are no longer than 8 characters, beginning with a letter. Those short names can be used by any software.
Variable names should not contain spaces, but may use the underscore character.
No other text rows such as titles should be in the spreadsheet.
No blank rows should appear in the data.
Always include an ID variable on your original data collection form and in the spreadsheet to help you find the case again if you need to correct errors. You may need to sort the data later, after which the row number in Excel would then apply to a different subject or sampling unit, making it hard to find.
Position the ID variable in the left-most column for easy reference.
If you have multiple groups, put them in the same spreadsheet along with a variable that indicates group membership (see Gender example below).
Many statistics packages don’t work well with alphabetic characters representing categorical values. For example to enter political party, you might enter 1 instead of Democrat, 2 instead of Republican and 3 instead of Other.
Avoid the use of special characters in numeric columns. Currency signs ($, €, etc.) can cause trouble in some programs.
If your group has only two levels, coding them 0 and 1 makes some analyses (e.g. linear regression) much easier to do. If the data are logical, use 0 for false, and 1 for true.
If the data represent gender, it’s common to use 0 for female, 1 for male.
For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, some software (like R) will read the column as character data so you will not be able to, for example, calculate the mean of a column without taking action to address the situation.
You can enter dates with slashes (8/31/2018) and times with colons (12:15 AM). Note that dates are recorded differently across countries, so make sure you are using a format that matches your locale.
For text analysis, you can enter up to 32K of text, or about 8 pages, in a single cell. However, if you cut & paste if from elsewhere, remove carriage returns first as they will cause it to jump to a new cell.
Relational Data Sets
Some data sets contain observations that are related in some way. They may be people who all live in the same home, or samples that all came from the same site. There may be higher levels of relations, such as students within classrooms, then classrooms within schools. Data that contains such relations (a.k.a. nesting) may be stored in a “relational” database, but those are harder to learn than spreadsheet software. Relational data can easily be entered as two or more spreadsheets and combined later during data analysis. This saves quite a lot of data entry as the higher level data (e.g. family house value, socio-economic status, etc.) only needs to be entered once, instead of on several lines (e.g. for each family member).
If you have such data, make sure that each data set contains a “key” variable that acts as a common ID number for family, site, school, etc. You can later read two files at a time and combine them matching on that key variable. R calls this combination a join or merge; SAS calls it a merge; and SPSS calls it Add Variables.
Example of a Good Data Structure
This data set follows all the rules for simple data sets above. Any statistics software can read it easily.
Example of a Bad Data Structure
This is the same data shown above, but it violates the rules for simple data sets in several ways: there is no column for gender, the income values contain dollar signs and commas, variable names appear on more than one line, variable names are not even consistent (income vs. salary), and there is a blank line in the middle. This would not be easy to read!
Data for Female Subjects
Data for Male Subjects
Excel Tips for Data Entry
You can make sure your variable names are always visible at the top of your Excel spreadsheet by choosing View> Freeze Panes> Freeze Top Row. This helps you enter data in the proper columns.
Avoid using Excel to sort your data. It’s too easy to sort one column independent of the others, which destroys your data! Statistics packages can sort data and they understand the importance of keeping all the values in each row locked together.
If you need to enter a pattern of consecutive values such as an ID number with values such as 1,2,3 or 1001,1002,1003, enter the first two, select those cells, then drag the tiny square in the lower right corner as far downward as you wish. Excel will see the pattern of the first two entries and extend it as far as you drag your selection. This works for days of the week and dates too. You can create your own lists in Options>Lists, if you use a certain pattern often.
To help prevent typos, you can set minimum and maximum values, or create a list of valid values. Select a column or set of similar columns, then go to the Data tab, then the Data Tools group, and choose Validation. To set minimum and maximum values, choose Allow: Whole Number or Decimals and then fill in the values in the Minimum and Maximum boxes. To create a list of valid values, choose Allow: List and then fill in the numeric or character values separated by commas in the Source box. Note that these rules only operate as you enter data, they will not help you find improper values that you have already entered.
The gold standard for data accuracy is the dual entry method. With this method you actually enter all the data twice. Only this method can catch errors that are within the normal range of values, but still wrong. Excel can show you where the values differ. Enter the data first in Sheet1. Then enter it again using the exact same layout in Sheet2. Finally, in Sheet1 select all cells using CTRL-A. Then choose Conditional Formatting> New Rule. Choose “Use a formula to determine which cells to format,” enter this formula:
then click the Format button, make sure the Fill tab is selected, and choose a color. Then click OK twice. The inconsistencies between the two sheets will then be highlighted in Sheet1. You then check to see which entry was wrong and fix it. When you read the data into a statistics package, you will only need to read the data in Sheet1.
When looking for data errors, it can be very helpful to display only a subset of values. To do this, select all the columns you wish to scan for errors, then click the Filter icon on the Data tab. A downward-pointing triangle will appear at the top of each column selected. Clicking it displays a list of the values contained in that column. If you have entered values that are supposed to be, for example, between 1 and 5 and you see 6 on this list, choosing it will show you only those rows in which you made that error. Then you can fix them. You can also use click on Number Filters to use simple logic to find, for example, all rows with values greater than 5. When you are finished, click on the filter icon again to turn it off.
Save your data frequently and make backup copies often. Don’t leave all your backup copies connected to a computer which would leave them vulnerable to attack by viruses. Don’t store them all in the same building or you risk losing all your hard work in a fire or theft. Get a free account at http://drive.google.com, http://dropbox.com, or http://onedrive.live.com and save copies there.
Steps for Reading Excel Data Into R
There are several ways to read an Excel file into R. Perhaps the easiest method uses the following commands. They read an excel file named mydata.xlsx into an R data frame called mydata. For examples on how to read many other file formats into R, see: http://r4stats.com/examples/data-import/.
# Do this once to install:
install.packages("readxl")# Each time you read a file, follow these steps
mydata <- read_excel("mydata.xlsx")
Steps for Reading Excel Data Into SPSS
In SPSS, choose File> Open> Data.
Change the “Files of file type” box to “Excel (*.xlsx)”
When the Read Excel File box appears, select the Worksheet name and check the box for Read variable names from the first row of data, then click OK.
When the data appears in the SPSS data editor spreadsheet, Choose File: Save as and leave the Save as type box to SPSS (*.sav).
Enter the name of the file without the .sav extension and then click Save to save the file in SPSS format.
Next time open the .sav version, you won’t need to convert the file again.
If you create variable or value labels in the SPSS file and then need to read your data from Excel again you can copy them into the new file. First, make sure you use the same variable names. Next, after opening the file in SPSS, use Copy Data Properties from the Data menu. Simply name the SPSS file that has properties (such as labels) that you want to copy, check off the things you want to copy and click OK.
Steps for Reading Excel Data Into SAS
The code below will read an excel file called mydata.xlsx and store it as a permanent SAS dataset called sasuser.mydata. If your organization is considering migrating from SAS to R, I offer some tips here: http://r4stats.com/articles/migrate-to-r/.
At the moment, jamovi can open CSV, JASP, SAS, SPSS, and Stata files, but not Excel. So you must open the data in Excel and Save As a comma separated value (CSV) file. The ability to read Excel files should be added to a release in the near future. For more information about the free and open source jamovi software, see my review here: http://r4stats.com/2018/02/13/jamovi-for-r-easy-but-controversial/.
More to Come
If you found this post useful, I invite you to check out many more on my website or follow me on Twitter where I announce my blog posts.
[An updated version of this post is located here.]
jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.
The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally, since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.
Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).
jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.
Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.
To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.
The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:
You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:
Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.
jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.
At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.
The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.
jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).
While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.
As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.
Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.
Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.
If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:
> t.test(pretest ~ gender, data = mydata100)
Welch Two Sample t-test
data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.810931 1.403879
sample estimates:
mean in group Female mean in group Male
74.60417 75.30769
> wilcox.test(pretest ~ gender, data = mydata100)
Wilcoxon rank sum test with continuity correction
data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0
> t.test(posttest ~ gender, data = mydata100)
Welch Two Sample t-test
data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.365939 1.853119
sample estimates:
mean in group Female mean in group Male
81.66667 82.42308
> wilcox.test(posttest ~ gender, data = mydata100)
Wilcoxon rank sum test with continuity correction
data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0
While the same comparison using the jamovi GUI, or its jmv package, would look like this:
Output from jamovi or its jmv package.
Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:
data = mydata100,
vars = c("pretest", "posttest"),
group = "gender",
mann = TRUE,
meanDiff = TRUE)
In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.
The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.
jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.
jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.
As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.
Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.
A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.
Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.
Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).
Though only at version, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.
In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!
Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.