Are attending this year’s Joint Statistical Meetings in Toronto? If so, stop by booth 404 to see the latest features of BlueSky Statistics. A menu-based graphical user interface for the R language, BlueSky lets people access the power of R without having to learn to program. Programmers can easily add code to BlueSky’s menus, sharing their expertise with non-programmers. My detailed review of BlueSky is here, a brief comparison to other R GUIs is here, and the BlueSky User Guide is here. I hope to see you in Toronto!
I’ve updated The Popularity of Data Science Software‘s market share estimates based on scholarly articles. I posted it below, so you don’t have to sift through the main article to read the new section.
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles.
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 4,500 articles) in the most recent complete year, 2022.
SPSS is the most popular package, as it has been for over 20 years. This may be due to its balance between power and its graphical user interface’s (GUI) ease of use. R is in second place with around two-thirds as many articles. It offers extreme power, but as with all languages, it requires memorizing and typing code. GraphPad Prism, another GUI-driven package, is in third place. The packages from MATLAB through TensorFlow are roughly at the same level. Next comes Python and Scikit Learn. The latter is a library for Python, so there is likely much overlap between those two. Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest. Old stalwart FORTRAN appears last in this plot. While its count seems close to zero, that’s due to the wide range of this scale, and its count is just over the 4,500-article cutoff for this plot.
Continuing on this scale would make the remaining packages appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 4,500 rather than the 110,000 used in Figure 2a. I chose that cutoff value because it allows us to see two related sets of tools on the same plot: workflow tools and GUIs for the R language that make it work much like SPSS.
JASP and jamovi are both front-ends to the R language and are way out front in this category. The next R GUI is R Commander, with half as many citations. Still, that’s far more than the rest of the R GUIs: BlueSky Statistics, Rattle, RKWard, R-Instat, and R AnalyticFlow. While many of these have low counts, we’ll soon see that the use of nearly all is rapidly growing.
Workflow tools are controlled by drawing 2-dimensional flowcharts that direct the flow of data and models through the analysis process. That approach is slightly more complex to learn than SPSS’ simple menus and dialog boxes, but it gets closer to the complete flexibility of code. In order of citation count, these include RapidMiner, KNIME, Orange Data Mining, IBM SPSS Modeler, SAS Enterprise Miner, Alteryx, and R AnalyticFlow. From RapidMiner to KNIME, to SPSS Modeler, the citation rate approximately cuts in half each time. Orange Data Mining comes next, at around 30% less. KNIME, Orange, and R Analytic Flow are all free and open-source.
While Figures 2a and 2b help study market share now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each software, but collecting that much data is too time-consuming. Instead, I’ve collected data only for the years 2019 and 2022. This provides the data needed to study growth over that period.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side) and the declining or “cooling” ones shown in blue (left side).
Seven of the 14 fastest-growing packages are GUI front-ends that make R easy to use. BlueSky’s actual percent growth was 2,960%, which I recoded as 220% as the original value made the rest of the plot unreadable. In 2022 the company released a Mac version, and the Mayo Clinic announced its migration from JMP to BlueSky; both likely had an impact. Similarly, jamovi’s actual growth was 452%, which I recoded to 200. One of the reasons the R GUIs were able to obtain such high percentages of change is that they were all starting from low numbers compared to most of the other software. So be sure to look at the raw counts in Figure 2b to see the raw counts for all the R GUIs.
The most impressive point on this plot is the one for PyTorch. Back on 2a we see that PyTorch was the fifth most popular tool for data science. Here we see it’s also the third fastest growing. Being big and growing fast is quite an achievement!
Of the workflow-based tools, Orange Data Mining is growing the fastest. There is a good chance that the next time I collect this data Orange will surpass SPSS Modeler.
The big losers in Figure 2c are the expensive proprietary tools: SPSS, GraphPad Prism, SAS, BMDP, Stata, Statistica, and Systat. However, open-source R is also declining, perhaps a victim of Python’s rising popularity.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d, I have plotted the same scholarly-use data for 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009, and its use is in sharp decline. SAS never came close to SPSS’s level of dominance, and its usage peaked around 2010. GraphPad Prism followed a similar pattern, though it peaked a bit later, around 2013.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 2015. The result is shown in Figure 2e.
Figure 2e shows that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. The decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this graph.
These results apply to scholarly articles in general. The results in specific fields or journals are likely to differ.
You can read the entire Popularity of Data Science Software here; the above discussion is just one section.
I have recently updated my extensive analysis of the popularity of data science software. This update covers perhaps the most important section, the one that measures popularity based on the number of job advertisements. I repeat it here as a blog post, so you don’t have to read the entire article.
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what will become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g., scikit-learn, Apache Spark), while others are used in data science jobs and, more broadly, in report-writing jobs (e.g., SAS, Tableau). General-purpose languages (e.g., Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that require them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the results in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on October 5, 2022. To measure percent change, I compare that to data collected on May 27, 2019. One might think that a sample on a single day might not be very stable, but they are. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002. I occasionally double-check some counts a month or so later and always get similar figures.
The number of jobs covers a very wide range from zero to 164,996, with a mean of 11,653.9 and a median of 845.0. The distribution is so skewed that placing them all on the same graph makes reading values difficult. Therefore, I split the graph into three, each with a different scale. A single plot with a logarithmic scale would be an alternative, but when I asked some mathematically astute people how various packages compared on such a plot, they were so far off that I dropped that approach.
Figure 1a shows the most popular tools, those with at least 10,000 jobs. SQL is in the lead with 164,996 jobs, followed by Python with 150,992 and Java with 113,944. Next comes a set from C++/C# at 48,555, slowly declining to Microsoft’s Power BI at 38,125. Tableau, one of Power BI’s major competitors, is in that set. Next comes R and SAS, both around 24K jobs, with R slightly in the lead. Finally, we see a set slowly declining from MATLAB at 17,736 to Scala at 11,473.
Figure 1b covers tools for which there are between 250 and 10,000 jobs. Alteryx and Apache Hive are at the top, both with around 8,400 jobs. There is quite a jump down to Databricks at 6,117 then much smaller drops from there to Minitab at 3,874. Then we see another big drop down to JMP at 2,693 after which things slowly decline until MLlib at 274.
The least popular set of software, those with fewer than 250 jobs, are displayed in Figure 1c. It begins with DataRobot and SAS’ Enterprise Miner, both near 182. That’s followed by Apache Mahout with 160, WEKA with 131, and Theano at 110. From RapidMiner on down, there is a slow decline until we finally hit zero at WPS Analytics. The latter is a version of the SAS language, so advertisements are likely to always list SAS as the required skill.
Several tools use the powerful yet easy workflow interface: Alteryx, KNIME, Enterprise Miner, RapidMiner, and SPSS Modeler. The scale of their counts is too broad to make a decent graph, so I have compiled those values in Table 1. There we see Alteryx is extremely dominant, with 30 times as many jobs as its closest competitor, KNIME. The latter is around 50% greater than Enterprise Miner, while RapidMiner and SPSS Modeler are tiny by comparison.
Let’s take a similar look at packages whose traditional focus was on statistical analysis. They have all added machine learning and artificial intelligence methods, but their reputation still lies mainly in statistics. We saw previously that when we consider the entire range of data science jobs, R was slightly ahead of SAS. Table 2 shows jobs with only the term “statistician” in their description. There we see that SAS comes out on top, though with such a tiny margin over R that you might see the reverse depending on the day you gather new data. Both are over five times as popular as Stata or SPSS, and ten times as popular as JMP. Minitab seems to be the only remaining contender in this arena.
|Software||Jobs only for “Statistician”|
Next, let’s look at the change in jobs from the 2019 data to now (October 2022), focusing on software that had at least 50 job listings back in 2019. Without such a limitation, software that increased from 1 job in 2019 to 5 jobs in 2022 would have a 500% increase but still would be of little interest. Percent change ranged from -64.0% to 2,479.9%, with a mean of 306.3 and a median of 213.6. There were two extreme outliers, IBM Watson, with apparent job growth of 2,479.9%, and Databricks, at 1,323%. Those two were so much greater than the rest that I left them off of Figure 1d to keep them from compressing the remaining values beyond legibility. The rapid growth of Databricks has been noted elsewhere. However, I would take IBM Watson’s figure with a grain of salt as its growth in revenue seems nowhere near what the Indeed.com’s job figure seems to indicate.
The remaining software is shown in Figure 1d, where those whose job market is “heating up” or growing are shown in red, while those that are cooling down are shown in blue. The main takeaway from this figure is that nearly the entire data science software market has grown over the last 3.5 years. At the top, we see Alteryx, with a growth of 850.7%. Splunk (702.6%) and Julia (686.2%) follow. To my surprise, FORTRAN follows, having gone from 195 jobs to 1,318, yielding growth of 575.9%! My supercomputing colleagues assure me that FORTRAN is still important in their area, but HPC is certainly not growing at that rate. If any readers have ideas on why this could occur, please leave your thoughts in the comments section below.
SQL and Java are both growing at around 537%. From Dataiku on down, the rate of growth slows steadily until we reach MLlib, which saw almost no change. Only two packages declined in job advertisements, with WEKA at -29.9%, Theano at -64.1%.
This wraps up my analysis of software popularity based on jobs. You can read my ten other approaches to this task at https://r4stats.com/articles/popularity/. Many of those are based on older data, but I plan to update them in the first quarter of 2023, when much of the needed data will become available. To receive notice of such updates, subscribe to this blog, or follow me on Twitter: https://twitter.com/BobMuenchen.
Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.
For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.
The book is available in print and various electronic formats. Because it is self-published, we plan to update it more rapidly than would be possible through traditional publishers.
Below you will find a detailed table of contents and a list of the textbook authors.
Table of Contents
OVERVIEW OF BIOMEDICAL DATA SCIENCE
- Background and history
- Conflicting perspectives
- the statistician’s perspective
- the machine learner’s perspective
- the database administrator’s perspective
- the data visualizer’s perspective
- Data analytical processes
- raw data
- data pre-processing
- exploratory data analysis (EDA)
- predictive modeling approaches
- types of models
- types of software
- Major types of analytics
- descriptive analytics
- diagnostic analytics
- predictive analytics (modeling)
- prescriptive analytics
- putting it all together
- Biomedical data science tools
- Biomedical data science education
- Biomedical data science careers
- Importance of soft skills in data science
- Biomedical data science resources
- Biomedical data science challenges
- Future trends
SPREADSHEET TOOLS AND TIPS
- basic spreadsheet functions
- download the sample spreadsheet
- Navigating the worksheet
- Clinical application of spreadsheets
- formulas and functions
- sorting data
- freezing panes
- conditional formatting
- pivot tables
- data analysis
- Tips and tricks
- Microsoft Excel shortcuts – windows users
- Google sheets tips and tricks
- Measures of central tendency & dispersion
- the normal and log-normal distributions
- Descriptive and inferential statistics
- Categorical data analysis
- Diagnostic tests
- Bayes’ theorem
- Types of research studies
- observational studies
- interventional studies
- Linear regression
- Comparing two groups
- the independent-samples t-test
- the wilcoxon-mann-whitney test
- Comparing more than two groups
- Other types of tests
- generalized tests
- exact or permutation tests
- bootstrap or resampling tests
- Stats packages and online calculators
- commercial packages
- non-commercial or open source packages
- online calculators
- Future trends
- historical data visualizations
- visualization frameworks
- Visualization basics
- Data visualization software
- Microsoft Excel
- Google sheets
- R programming language
- other visualization programs
- Visualization options
- visualizing categorical data
- visualizing continuous data
- Geographic maps
INTRODUCTION TO DATABASES
- A brief history of database models
- hierarchical model
- network model
- relational model
- Relational database structure
- Clinical data warehouses (CDWs)
- Structured query language (SQL)
- Learning SQL
- The seven v’s of big data related to health care data
- Technical background
- Future trends
BIOINFORMATICS and PRECISION MEDICINE
- Biological data analysis – from data to discovery
- Biological data types
- bioinformatics data in public repositories
- biomedical cancer data portals
- Tools for analyzing bioinformatics data
- command line tools
- web-based tools
- Genomic data analysis
- Genomic data analysis workflow
- variant calling pipeline for whole exome sequencing data
- quality check
- variant calling
- variant filtering and annotation
- downstream analysis
- reporting and visualization
- Precision medicine – from big data to patient care
- Examples of precision medicine
- Future trends
- Useful resources
PROGRAMMING LANGUAGES FOR DATA ANALYSIS
- R language
- installing R & rstudio
- an example R program
- getting help in R
- user interfaces for R
- R’s default user interface: rgui
- menu & dialog guis
- some popular R guis
- R graphical user interface comparison
- R resources
- Python language
- installing Python
- an example Python program
- getting help in Python
- user interfaces for Python
- R vs. Python
- Future trends
- Brief history
- data refresher
- training vs test data
- bias and variance
- supervised and unsupervised learning
- Common machine learning algorithms
- Supervised learning
- Unsupervised learning
- dimensionality reduction
- reinforcement learning
- semi-supervised learning
- Evaluation of predictive analytical performance
- classification model evaluation
- regression model evaluation
- Machine learning software
- Rapidminer studio
- Google TensorFlow
- honorable mention
- Programming languages and machine learning
- Machine learning challenges
- Machine learning examples
- example 1 classification
- example 2 regression
- example 3 clustering
- example 4 association rules
- Ai architectures
- Deep learning
- Image analysis (computer vision)
- Wearable devices
- Image libraries and packages
- Natural language processing
- NLP libraries and packages
- Text mining and medicine
- Speech recognition
- Electronic health record data and AI
- Genomic analysis
- AI platforms
- deep learning platforms and programs
- Artificial intelligence challenges
- Data issues
- Socio economic and legal
- Adverse unintended consequences
- Need for more ML and AI education
- Future trends
Robert Hoyt MD, FACP, ABPM-CI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
David Hurwitz MD, FACP, ABPM-CI
Allscripts Healthcare Solutions
Madhurima Kaushal MS
Washington University at St. Louis, School of Medicine
St. Louis, MO
Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA
New York Medical College
Department of Emergency Medicine
Karen A. Monsen PhD, RN, FAMIA, FAAN
School of Nursing
University of Minnesota
Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.
The WPS Analytics’ version of the SAS language is now available in a Community Edition. This edition allows you to run SAS code on datasets of any size for free. Purchasing a commercial license will get you tech support and the ability to run it from the command line, instead of just interactively. The software license details are listed in this table.
While the WPS version of the SAS language doesn’t do everything the version from SAS Institute offers, it does do quite a lot. The complete list of features is available here.
Back in 2009, the SAS Institute filed a lawsuit against the creators of WPS Analytics, World Programming Limited (WPL), in the High Court of England and Wales. SAS Institute lost the case on the grounds that copyright law applies to software source code, not to its functionality. WPL never had access to SAS Institute’s source code, but they did use a SAS educational license to study how it works. SAS Institute lost another software copyright battle in North Carolina courts, but won over the use of their educational license. SAS Institute is suing a third time, hoping to do better by carefully choosing a pro-patent court in East Texas.
Although I prefer using R, I’m a big fan of the SAS language, as well as SAS Institute, which offers superb technical support. However, I agree with the first two court findings. Copyright law should not apply to a computer language, only to a particular set of source code that creates the language.
by Bob Muenchen & Sean MacKinnon
One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found each year for each software. He chose Google Scholar since it searches “across many disciplines and sources: articles, theses, books, abstracts, and court opinions, from academic publishers, professional societies, online repositories, universities, and other web sites.” Figure 1 shows the results from 1995 through 2016. Data collected in 2018 showed that while SPSS use dropped 39% drop from 2017 to 2018, its use was still 66% higher than R in 2018.
We see in the plot that SPSS was extremely dominant for most of that time period. Even after its precipitous decline, it still beats the rest by more than a 2 to 1 margin. Over the years, several people questioned the accuracy of Figure 1. In a time when scholarly publications are proliferating, how could SPSS use be in such decline?
One hypothesis that has often been suggested revolves around one of the most bizarre product name changes in the history of marketing. As a result of a legal battle for control of the name “SPSS”, the SPSS company changed the name of the product to “PASW”, an acronym for Predictive Analytics Software. The change made about as much sense as Coke people renaming Coke to “BSW”, for Bubbly Sugar Water. The battle was settled and in 2011 and the product name reverted back to SPSS.
Could that name change account for the apparent decline in its use? A search on Google Scholar from 2009 to 2012 on the string:
“PASW” -“SPSS” -“Amos”
yielded 12,000 hits. That sounds like quite a few, but when “SPSS” was substituted for “PASW” in that search, we found 701,000 references. At first glance, it seems that the scholarly use of SPSS was undercounted by 1.7%. However, when searching a vast volume of documents, each string may have problems with over-counting. For example, PASW stands for “Plant Available Soil Water” which accounts for 138 of those 12,000 articles. There may be many other such abbreviations. That’s the type of analysis Muenchen did several years ago, before concluding that PASW was more trouble than it was worth (details are here). In 2018 that search yields only 361 hits, and the title of the very first article begins with, “Projections Analysis of Surface Waves (PASW)…”
Muenchen’s hypothesis regarding the apparent decline of SPSS is that it was caused by competition. Back in 2002, SPSS shared the statistical software market with SAS and a couple of others. Its momentum carried it upward for a few more years, then the competition started chipping away at it. GraphPad Prism improved significantly with the release of its version 5 in 2007 and medical users of SPSS found an alternative that was as easy to use while focusing more on their needs. R added enough useful packages around the same time to become competitive. By now there are probably hundreds of packages that people can use to analyze data, only a few of which are shown in Figure 1.
Mackinnon remained skeptical of this hypothesis because the overall graph appears to show decreases in statistical software citation over time. This would seem to contradict evidence that the number of journal articles published has been increasing at about 3% per year over the last 3 centuries, and about 3.9% per year in the past decade (2018 STM Report, pg. 25). Thus, the total number of citations to statistical software as a collective group should be increasing concurrently with this overall increase.
Mackinnon gathered data from a different source: Scopus. According to Wikipedia, “Scopus covers nearly 36,377 titles from approximately 11,678 publishers, of which 34,346 are peer-reviewed journals in top-level subject fields: life sciences, social sciences, physical sciences, and health sciences.” Mackinnon limited the search to reference lists, reasoning that such citations are likely an indicator of using the software in the paper. Two search strings were used:
REF(“the R software” OR “the R project” OR “r-project.org” OR “R development core”)
These searches are being a bit generous to SPSS by including Modeler and AMOS, and very conservative for R by not including citations to common packages (e.g., ggplot2). The resulting data are plotted in Figure 2.
Above we see that the citations of R in scholarly journals exceeded that of SPSS back in 2012. However, the scale of Figure 2 tops out at 30,000 while Figure 1’s scale peaks at 300,000. Google is finding a lot more documents! So, which of these software packages is used the most in scholarly work? Good question! We would like to hear your comments below, especially from readers who collect data from other sources.
It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which I repeat below. It shows the total number of features in each package, including the latest versions of BlueSky Statistics, JASP, and jamovi. The reviews which initially appeared as blog posts are now regularly-updated pages.
New Features in JASP
Let’s take a look at some of the new features, starting with the version of JASP that was released three hours ago:
- Interface adjustments
- Data panel, analysis input panel and results panel can be manipulated much more intuitively with sliders and show/hide buttons
- Changed the analysis input panel to have an overview of all opened analyses and added the possibility to change titles, to show documentation, and remove analyses
- Enhanced the navigation through the file menu; it is now possible to use arrow keys or simply hover over the buttons
- Added possibility to scale the entire application with Ctrl +, Ctrl – and Ctrl 0
- Added MANOVA
- Added Confirmatory Factor Analysis
- Added Bayesian Multinomial Test
- Included additional menu preferences to customize JASP to your needs
- Added/updated help files for most analyses
- R engine updated from 3.4.4 to 3.5.2
- Added Šidák correction for post-hoc tests (AN(C)OVA)
New Features in jamovi
Two of the usability features added to jamovi recently are templates and multi-file input. Both are described in detail here.
Templates enable you to save all the steps in your work as a template file. Opening that file in jamovi then lets you open a new dataset and the template will recreate all the previous analyses and graphs using the new data. It provides reusability without having to depend on the R code that GUI users are trying to avoid using.
The multi-file input lets you select many CSV files at once and jamovi will open and stack them all (they must contain common variable names, of course).
Other new analytic features have been added with a set of modeling modules. They’re described in detail here, and a list of some of their capability is below. You can read my full review of jamovi here, and you can download it for free here.
- OLS Regression (GLM)
- OLS ANOVA (GLM)
- OLS ANCOVA (GLM)
- Random coefficients regression (Mixed)
- Random coefficients ANOVA-ANCOVA (Mixed)
- Logistic regression (GZLM)
- Logistic ANOVA-like model (GZLM)
- Probit regression (GZLM)
- Probit ANOVA-like model (GZLM)
- Multinomial regression (GZLM)
- Multinomial ANOVA-like model (GZLM)
- Poisson regression (GZLM)
- Poisson ANOVA-like model (GZLM)
- Overdispersed Poisson regression (GZLM)
- Overdispersed Poisson ANOVA-like model (GZLM)
- Negative binomial regression (GZLM)
- Negative binomial ANOVA-like model (GZLM)
- Continuous and categorical independent variables
- Omnibus tests and parameter estimates
- Confidence intervals
- Simple slopes analysis
- Simple effects
- Post-hoc tests
- Plots for up to three-way interactions for both categorical and continuous independent variables.
- Automatic selection of best estimation methods and degrees of freedom selection
- Type III estimation
New Features in BlueSky Statistics
The BlueSky developers have been working on adding psychometric methods (for a book that is due out soon) and support for distributions. My full review is here and you can download BlueSky Statistics for free here.
- Model Fitting: IRT: Simple Rasch Model
- Model Fitting: IRT: Simple Rasch Model (Multi-Faceted)
- Model Fitting: IRT: Partial Credit Model
- Model Fitting: IRT: Partial Credit Model (Multi-Faceted)
- Model Fitting: IRT: Rating Scale Model
- Model Fitting: IRT: Rating Scale Model (Multi-Faceted)
- Model Statistics: IRT: ICC Plots
- Model Statistics: IRT: Item Fit
- Model Statistics: IRT: Plot PI Map
- Model Statistics: IRT: Item and Test Information
- Model Statistics: IRT: Likelihood Ratio and Beta plots
- Model Statistics: IRT: Personfit
- Distributions: Continuous: BetaProbabilities
- Distributions: Continuous: Beta Quantiles
- Distributions: Continuous: Plot Beta Distribution
- Distributions: Continuous: Sample from Beta Distribution
- Distributions: Continuous: Cauchy Probabilities
- Distributions: Continuous: Plot Cauchy Distribution
- Distributions: Continuous: Cauchy Quantiles
- Distributions: Continuous: Sample from Cauchy Distribution
- Distributions: Continuous: Sample from Cauchy Distribution
- Distributions: Continuous: Chi-squared Probabilities
- Distributions: Continuous: Chi-squared Quantiles
- Distributions: Continuous: Plot Chi-squared Distribution
- Distributions: Continuous: Sample from Chi-squared Distribution
- Distributions: Continuous: Exponential Probabilities
- Distributions: Continuous: Exponential Quantiles
- Distributions: Continuous: Plot Exponential Distribution
- Distributions: Continuous: Sample from Exponential Distribution
- Distributions: Continuous: F Probabilities
- Distributions: Continuous: F Quantiles
- Distributions: Continuous: Plot F Distribution
- Distributions: Continuous: Sample from F Distribution
- Distributions: Continuous: Gamma Probabilities
- Distributions: Continuous: Gamma Quantiles
- Distributions: Continuous: Plot Gamma Distribution
- Distributions: Continuous: Sample from Gamma Distribution
- Distributions: Continuous: Gumbel Probabilities
- Distributions: Continuous: Gumbel Quantiles
- Distributions: Continuous: Plot Gumbel Distribution
- Distributions: Continuous: Sample from Gumbel Distribution
- Distributions: Continuous: Logistic Probabilities
- Distributions: Continuous: Logistic Quantiles
- Distributions: Continuous: Plot Logistic Distribution
- Distributions: Continuous: Sample from Logistic Distribution
- Distributions: Continuous: Lognormal Probabilities
- Distributions: Continuous: Lognormal Quantiles
- Distributions: Continuous: Plot Lognormal Distribution
- Distributions: Continuous: Sample from Lognormal Distribution
- Distributions: Continuous: Normal Probabilities
- Distributions: Continuous: Normal Quantiles
- Distributions: Continuous: Plot Normal Distribution
- Distributions: Continuous: Sample from Normal Distribution
- Distributions: Continuous: t Probabilities
- Distributions: Continuous: t Quantiles
- Distributions: Continuous: Plot t Distribution
- Distributions: Continuous: Sample from t Distribution
- Distributions: Continuous: Uniform Probabilities
- Distributions: Continuous: Uniform Quantiles
- Distributions: Continuous: Plot Uniform Distribution
- Distributions: Continuous: Sample from Uniform Distribution
- Distributions: Continuous: Weibull Probabilities
- Distributions: Continuous: Weibull Quantiles
- Distributions: Continuous: Plot Weibull Distribution
- Distributions: Continuous: Sample from Weibull Distribution
- Distributions: Discrete: Binomial Probabilities
- Distributions: Discrete: Binomial Quantiles
- Distributions: Discrete: Binomial Tail Probabilities
- Distributions: Discrete: Plot Binomial Distribution
- Distributions: Discrete: Sample from Binomial Distribution
- Distributions: Discrete: Geometric Probabilities
- Distributions: Discrete: Geometric Quantiles
- Distributions: Discrete: Geometric Tail Probabilities
- Distributions: Discrete: Plot Geometric Distribution
- Distributions: Discrete: Sample from Geometric Distribution
- Distributions: Discrete: Hypergeometric Probabilities
- Distributions: Discrete: Hypergeometric Quantiles
- Distributions: Discrete: Hypergeometric Tail Probabilities
- Distributions: Discrete: Plot Hypergeometric Distribution
- Distributions: Discrete: Sample from Hypergeometric Distribution
- Distributions: Discrete: Negative Binomial Probabilities
- Distributions: Discrete: Negative Binomial Quantiles
- Distributions: Discrete: Negative Binomial Tail Probabilities
- Distributions: Discrete: Plot Negative Binomial Distribution
- Distributions: Discrete: Sample from Negative Binomial Distribution
- Distributions: Discrete: Poisson Probabilities
- Distributions: Discrete: Poisson Quantiles
- Distributions: Discrete: Poisson Tail Probabilities
- Distributions: Discrete: Plot Poisson Distribution
- Distributions: Discrete: Sample from Poisson Distribution
In my ongoing quest to track The Popularity of Data Science Software, I’ve just updated my analysis of the job market. To save you from reading the entire tome, I’m reproducing that section here.
One of the best ways to measure the popularity or market share of software for data science is to count the number of job advertisements that highlight knowledge of each as a requirement. Job ads are rich in information and are backed by money, so they are perhaps the best measure of how popular each software is now. Plots of change in job demand give us a good idea of what is likely to become more popular in the future.
Indeed.com is the biggest job site in the U.S., making its collection of job ads the best around. As their co-founder and former CEO Paul Forster stated, Indeed.com includes “all the jobs from over 1,000 unique sources, comprising the major job boards – Monster, CareerBuilder, HotJobs, Craigslist – as well as hundreds of newspapers, associations, and company websites.” Indeed.com also has superb search capabilities. It used to have a job trend plotter, but that tool has apparently been shut down.
Searching for jobs using Indeed.com is easy, but searching for software in a way that ensures fair comparisons across packages is challenging. Some software is used only for data science (e.g. SPSS, Apache Spark) while others are used in data science jobs and more broadly in report-writing jobs (e.g. SAS, Tableau). General-purpose languages (e.g. Python, C, Java) are heavily used in data science jobs, but the vast majority of jobs that use them have nothing to do with data science. To level the playing field, I developed a protocol to focus the search for each software within only jobs for data scientists. The details of this protocol are described in a separate article, How to Search for Data Science Jobs. All of the graphs in this section use those procedures to make the required queries.
I collected the job counts discussed in this section on May 27, 2019 and February 24, 2017. One might think that a sample of on a single day might not be very stable, but the large number of job sources makes the counts in Indeed.com’s collection of jobs quite consistent. Data collected in 2017 and 2014 using the same protocol correlated r=.94, p=.002.
Figure 1a shows that Python is in the lead with 27,374 jobs, followed by SQL with 25,877. Java and Amazon’s Machine Learning (ML) tools are roughly 25% further below, with jobs in the 17,000s. R and the C variants come next with around 13,000. People frequently compare R and Python, but when it comes to getting a data science job, there are only half as many for R as for Python. That doesn’t mean they’re the same sort of job, of course. I still see more statisticians using R and machine learning people preferring Python, but Python is definitely on a roll! From Hadoop on down, there is a slow decline in jobs. R is also frequently compared to SAS, which has only 8,123 compared to R’s 13,800.
The scale of Figure 1a is so wide that the bottom package, H20 appears to be zero, when in fact there are 257 jobs for it.
To let us compare the less popular software, I plotted them separately in Figure 1b. Mathematica and Julia are the leaders of this set, with around 219 jobs each. The ancient FORTRAN language is still hanging on to life with 195 jobs. The open source WEKA software and IBM’s Watson are next, with around 185 each. From XGBOOST on down, there is a fairly steady slow decline.
There are several tools that use a workflow interface: Enterprise Miner, KNIME, RapidMiner, and SPSS Modeler. They’re all around the same area between 50 and 100 jobs. In many of the other measures of popularity, RapidMiner beats the very similar KNIME tool, but here there are 50% more jobs for the latter. Alteryx is also a workflow-based tool, however, it has pulled away from the pack, appearing back on Figure 1a with 901 jobs.
When interpreting the scale on Figure 1b, what looks like zero is indeed zero. From Systat on down, none of the packages have more than 10 job listings.
It’s important to note that the values shown in Figures 1a and 1b are single points in time. The number of jobs for the more popular software do not change much from day to day. Therefore, the relative rankings of the software shown in Figure 1a is unlikely to change much over the coming year or two. The less popular packages shown in Figure 1b have such low job counts that their ranking is more likely to shift from month to month, though their position relative to the major packages should remain more stable.
Next, let’s look at the change in jobs from the 2017 data to now (2019). Figure 1c shows the percent change for those packages that had at least 100 job listings back in 2017. Without such a limitation, software that goes from 1 job in 2017 to 5 jobs in 2019 would have a 500% increase, but still would be of little interest. Software whose job market is heating up, or growing, is shown in red, while those that are cooling down are shown in blue.
Tensorflow, the deep learning software from Google, is the fastest growing at 523%. Next is Apache Flink, a tool that analyzes streaming data, at 289%. H2O is next, with 150% growth. Caffe is another deep learning framework and its 123% growth reflects the popularity of artificial intelligence algorithms.
Python shows “only” 97% growth, but its popularity was already so high that the 13,471 jobs that it added surpasses the total jobs of many of the other packages!
Tableau is showing a similar rate of growth, though it was a comparably small number of additional jobs, at 4,784.
From the Julia language on down, we see a slowing decrease in growth. I’m surprised to see that jobs for SAS and SPSS are still growing, though barely at 6% and 1%, respectively.
If you enjoyed reading this article, you might be interested in my recent series of reviews on point-and-click front-ends for the R language. I invite you to subscribe to this blog, or follow me on Twitter.
In my neverending quest to track The Popularity of Data Science Software, it’s time to update the section on Scholarly Articles. The rapid growth of R could not go on forever and, as you’ll see below, its use actually declined over the last year.
Scholarly articles provide a rich source of information about data science tools. Because publishing requires significant amounts of effort, analyzing the type of data science tools used in scholarly articles provides a better picture of their popularity than a simple survey of tool usage. The more popular a software package is, the more likely it will appear in scholarly publications as an analysis tool, or even as an object of study.
Since scholarly articles tend to use cutting-edge methods, the software used in them can be a leading indicator of where the overall market of data science software is headed. Google Scholar offers a way to measure such activity. However, no search of this magnitude is perfect; each will include some irrelevant articles and reject some relevant ones. The details of the search terms I used are complex enough to move to a companion article, How to Search For Data Science Articles. Since Google regularly improves its search algorithm, each year I collect data again for the previous years (with one exception noted below).
Figure 2a shows the number of articles found for the more popular software packages and languages (those with at least 1,700 articles) in the most recent complete year, 2018. To allow ample time for publication, insertion into online databases, and indexing, the was data collected on 3/28/2019.
SPSS is by far the most dominant package, as it has been for over 20 years. This may be due to its balance between power and ease-of-use. R is in second place with around half as many articles. It offers extreme power, though with less ease of use. SAS is in third place, with a slight lead over Stata, MATLAB, and GraphPad Prism, which are nearly tied.
Note that the general-purpose languages: C, C++, C#, FORTRAN, Java, MATLAB, and Python are included only when found in combination with data science terms, so view those counts as more of an approximation than the rest.
The next group of packages goes from Python through C, with usage declining slowly. The next set starts at Caffe, dropping nearly 50%, and continuing to IBM Watson with a slow decline.
The last two packages in Fig 2a are Weka and Theano, which are quite a drop from IBM Watson, though it’s getting harder to see as the lines shrink.
To continue on this scale would make the remaining packages all appear too close to the y-axis to read, so Figure 2b shows the remaining software on a much smaller scale, with the y-axis going to only 1,700 rather than the 80,000 used on Figure 2a.
I chose to begin Figure 2b with software that has fewer than 1,700 articles because it allows us to see RapidMiner and KNIME on the same scale. They are both workflow-driven tools with very similar capabilities. This plot shows RapidMiner with 49% greater usage than KNIME. RapidMiner uses more marketing, while KNIME depends more on word-of-mouth recommendations and a more open source model. The IT advisory firms Gartner and Forrester rate them as tools able to hold their own against the commercial titans, IBM’s SPSS and SAS. Given that SPSS has roughly 50 times the usage in academia, that seems like quite a stretch. However, as we will soon see, usage of these newer packages are growing, while the use of the older ones is shrinking quite rapidly.
Figure 2b also lets us see IBM’s SPSS Modeler, SAS Enterprise Miner, and Alteryx on the same plot. These three are also workflow-driven tools which are quite expensive. None are doing as well here as RapidMiner or KNIME, tools that much less expensive – or free – depending on how you use them (KNIME desktop is free but
Another interesting comparison
Even newer on the GUI for R scene is BlueSky Statistics, which doesn’t appear on the plot at all since it has zero scholarly articles so far. It was created by a new company and only adopted an open source model a few months ago.
While Figures 2a and 2b are useful for studying market share as it stands now, they don’t show how things are changing. It would be ideal to have long-term growth trend graphs for each of the analytics packages, but collecting that much data annually is too time-consuming. What I’ve done instead is collect data only for the past two complete years, 2017 and 2018. This provides the data needed to study year-over-year changes.
Figure 2c shows the percent change across those years, with the growing “hot” packages shown in red (right side); the declining or “cooling” are shown in blue (left side). Since the number of articles tends to be in the thousands or tens of thousands, I have removed any software that had fewer than 1,000 articles in 2015. A package that grows from 1 article to 5 may demonstrate 500% growth but is still of little interest.
The recent changes in data science software can be summarized succinctly: AI/ML up; statistics down. The software that is growing contains none of the packages that are associated more with statistical analysis. The software in decline is dominated by the classic packages of statistics: SPSS Statistics, SAS, GraphPad Prism, Stata, Statgraphics, R, Statistica, Systat, and Minitab. JMP is the only traditional statistics package whose scholarly usage is growing. Of the machine learning software that’s declining in usage, there are rough equivalents that are growing (e.g. Mahout down, Spark up).
Statistics software has been around much longer than AI/ML software, started back in the days before open source. Stat vendors have been adding AI/ML methods to their software, making them the more comprehensive solutions. The AI/ML vendors or projects are missing an opportunity to add more comprehensive statistics capabilities. Some, such as RapidMiner and KNIME, are indeed expanding in this direction, but very slowly indeed.
At the top of Figure 2c, we see that the deep learning packages Keras and TensorFlow are the fastest growing at nearly 150%. PyTorch is not shown here because it did not have enough usage in the previous year. However, its citation rate went from 616 to 4,670, a substantial 658% growth rate! There are other packages that are not shown here, including JASP with 223% growth, and jamovi with 720% growth. Despite such high growth, the latter still only has 108 citations in 2018. The rapid growth of JASP and jamovi lend credence to the perspective that the overall pattern of change shown in Figure 2c may be more of a result of free vs. expensive software. Neither of them offers any AI/ML features.
Scikit Learn, the Python machine learning library, was a fast grower with a 60% increase.
In the RapidMiner vs. KNIME contest, we saw previously that RapidMiner was ahead. From this plot, we that KNIME growing slightly (5.7%) while RapidMiner is declining slightly (1.8%).
The biggest losers in Figure 2c are SPSS, down 39%, and SAS, Prism, and Mahout, all down 24%. Even R is down 13%. Recall that Figure 2a shows that despite recent years of decline, SPSS is still extremely dominant for scholarly use, and R and SAS are still the #2 and #3 most widely used packages in this arena.
I’m particularly interested in the long-term trends of the classic statistics packages. So in Figure 2d I have plotted the same scholarly-use data for 1995 through 2016.
SPSS has a clear lead overall, but now you can see that its dominance peaked in 2009 and its use is in sharp decline. SAS never came close to SPSS’ level of dominance, and its use peaked around 2010.
In Figure 2d, the extreme dominance of SPSS makes it hard to see long-term trends in the other software. To address this problem, I have removed SPSS and all the data from SAS except for 2014 and 1015. The result is shown in Figure 2e.
Figure 2e makes it easy to see that most of the remaining packages grew steadily across the time period shown. R and Stata grew especially fast, as did Prism until 2012. Note that the decline in the number of articles that used SPSS, SAS, or Prism is not balanced by the increase in the other software shown in this particular graph. Even adding up all the other software shown in Figures 2a and 2b doesn’t account for the overall decline. However, I’m looking at only 58 out of over 100 data science tools.
While Figures 2d and 2e show the historical trend that ended in 2016, Figure 2f shows a fresh set of data collected in
In Figure 2f we can see that the downward trends of SAS, Prism, and Statistica are continuing. We also see that the long and rapid growth of R and Stata has come to an end. Growth that rapid can’t go on forever. It will be interesting to see next year to see if this is merely a flattening of usage or the beginning of a declining trend. As I pointed out in my book, R for Stata Users, there are many commonalities between R and Stata. As a result of this, and the fact that R is open source, I expect
SPSS’ long-term rapid decline has to level out at some point. They have been chipped away at by many competitors. However, until recently these competitors have either been free and code-based such as R, or menu-based and proprietary, such as Prism. With the fairly recent arrival of JASP, jamovi, and BlueSky Statistics, SPSS now faces software that is both free and menu-based. Previous projects to add menus to R, such as the R Commander and Deducer, were also free and open source, but they required installing R separately and then using R code to activate the menus.
These results apply to scholarly articles in general. The results in specific fields or journals are very likely to be different.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the job advertisements that list science software. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
Update: an earlier version of this post included figures that I’ve removed at the request of Forrester, Inc.
In my previous post, I discussed Gartner’s reviews of data science software companies. In this post, I describe Forrester’s coverage and discuss how radically different it is. As usual, this post is already integrated into my regularly-updated article, The Popularity of Data Science Software.
Forrester Research, Inc. is a leading global research and advisory firm that reviews data science software vendors. Studying their reports and comparing them to Gartner’s can provide a deeper understanding of the software these vendors provide.
Historically, Forrester has conducted their analyses similarly to Gartner’s. That approach compares software that uses point-and-click style software like KNIME, to software that emphasizes coding, such as Anaconda. To make apples-to-apples comparisons, Forrester decided to spit the two types of software into separate reports.
The Forrester Wave: Multimodal Predictive Analytics and Machine Learning Solutions, Q3, 2018 covers software that is controllable by various means such as menus, workflows, wizards, or code (as of 23/22/2019 available free here). Forrester plans to cover tools for automated modeling in a separate report, due out in 2019. Given that automation is now a widely adopted feature of the several companies covered in this report, that seems like an odd approach.
Forrester divides the vendors into four categories: Leaders, Strong Performers, Contenders, and Challengers.
In the Leaders category, they include IBM, while Gartner viewed them as a middle-of-the-pack Visionary. Forrester and Gartner both view SAS and RapidMiner as leaders.
The Strong Performers category includes KNIME, which Gartner considered a Leader. Datawatch and Tibco are tied in this segment while Gartner had them far apart, with Datawatch put in very last place by Gartner. Forrester has KNIME and SAP next to each other in this category, while Gartner had them far apart, with KNIME a Leader and SAP a Niche Player. Dataiku is here too, with a similar rating to Gartner.
The Contenders segment contains Microsoft and Mathworks, in positions similar to Gartner’s. Fico is here too; Gartner did not evaluate them.
Forrester’s Challengers segment includes World Programming, which sells SAS-compatible software, and Minitab, which purchased Salford Systems. Neither were considered by Gartner.
The Forrester Wave: Notebook-Based Solutions, Q3, 2018 reviews software controlled by notebooks, which blend programming code and output in the same window (as of 3/22/2019 available here).
Forrester rates some of the notebook-based vendors very differently than Gartner. Here Domino Data Labs is a Leader while Gartner had them at the extreme other end of their plot, in the Niche Players quadrant. Oracle is also shown as a Leader, though its strength is this market is minimal.
In the Strong Performers category are Databricks and H2O.ai, in very similar positions compared to Gartner. Civis Analytics and OpenText are also in this category; neither were reviewed by Gartner. Cloudera is here as well; it too was left out by Gartner.
Forrester’s Condenders category contains Google, in a similar position compared to Gartner’s analysis. Anaconda is here too, in a position quite a bit higher than in Gartner’s plot.
The only two companies rated by Gartner but ignored by Forrester are Alteryx and DataRobot. The latter will no doubt be covered in Forrester’s report on automated modelers, due out this summer.
As with my coverage of Gartner’s report, my summary here barely scratches the surface of the two Forrester reports. Both provide insightful analyses of the vendors and the software they create. I recommend reading both (and learning more about open source software) before making any purchasing decisions.
To see many other ways to estimate the market share of this type of software, see my ongoing article, The Popularity of Data Science Software. My next post will update the scholarly use of data science software, a leading indicator. You may also be interested in my in-depth reviews of point-and-click user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!