Using Discussion Forum Activity to Estimate Analytics Software Market Share

I’m finally getting around to overhauling the Discussion Forum Activity section of The Popularity of Data Analysis Software. To save you the trouble of reading all 43 pages, I’m posting just this section below.

Discussion Forum Activity

Another way to measure software popularity is to see how many people are helping one another use each package or language. While such data is readily available, it too has its problems. Menu-driven software like SPSS or workflow-driven software such as KNIME are quite easy to use and tend to generate fewer questions. Software controlled by programming requires the memorization of many commands and requiring more support. Even within languages, some are harder to use than others, generating more questions (see Why R is Hard to Learn).

Another problem with this type of data is that there are many places to ask questions and each has its own focus. Some are interested in a classical statistics perspective while others have a broad view of software as general-purpose programming languages. In recent years, companies have set up support sites within their main corporate web site, further splintering the places you can go to get help. Usage data for such sites is not readily available.

Another problem is that it’s not as easy to use logic to focus in on specific types of questions as it was with the data from job advertisements and scholarly articles discussed earlier. It’s also not easy to get the data across time to allow us to study trends.  Finally, the things such sites measure include: software group members (a.k.a. followers), individual topics (a.k.a. questions or threads), and total comments across all topics (a.k.a. total posts). This makes combining counts across sites problematic.

Two of the biggest sites used to discuss software are LinkedIn and Quora. They both display the number of people who follow each software topic, so combining their figures makes sense. However, since the sites lack any focus on analytics, I have not collected their data on general purpose languages like Java, MATLAB, Python or variants of C. The results of data collected on 10/17/2015 are shown here:

LinkedIn_Quora_2015

We see that R is the dominant software and that moving down through SAS, SPSS, and Stata results in a loss of roughly half the number of people in each step. Lavastorm follows Stata, but I find it odd that there was absolutely zero discussion of Lavastorm on Quora. The last bar that you can even see on this plot is the 62 people who follow Minitab. All the ones below that have tiny audiences of fewer than 10.

Next let’s examine two sites that focus only on statistical questions: Talk Stats and Cross Validated. They both report the number of questions (a.k.a. threads) for a given piece of software, allowing me to total their counts:

CrossValidated_TalkStats_2015

We see that R has a 4-to-1 lead over the next most popular package, SPSS. Stata comes in at 3rd place, followed by SAS. The fact that SAS is in fourth place here may be due to the fact that it is strong in data management and report writing, which are not the types of questions that these two sites focus on. Although MATLAB and Python are general purpose languages, I include them here because the questions on this site are within the realm of analytics. Note that I collected data on as many packages as were shown in the previous graph, but those not shown have a count of zero. Julia appears to have a count of zero due to the scale of the graph, but it actually had 5 questions on Cross Validated.

If you found this interesting, you can read about the results of other surveys and several other ways to measure software popularity here.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

28 thoughts on “Using Discussion Forum Activity to Estimate Analytics Software Market Share”

  1. Maybe you should consider postings on Statalist.org and/or include the archives from the original Statalist (the listserv archives are still accessible online). It seems fairly unrealistic that there would be a greater number of SPSS users conversing (which supports your argument about the GUI-based usage), but in a little over 1 year there have been 52,587 postings on the Statalist website. It seems a bit biased not to include forums where these users typically exchange information (e.g., it would be no different than leaving out any of the R-based discussion groups/listservs).

    1. Hi William,

      I would love to have such data from at least the top 10 vendors. But since I can’t get it from all of them, it seemed more biased to talk about only a few. But I’m glad you posted the figure for Statalist. That at least gives people other data to consider. If other vendors provide their figures, I will most happily post them!

      Cheers,
      Bob

      1. This isn’t a vendor figure. The Statalist is a community forum. For the first 10-15 years or so it was a listserv hosted at Harvard. When the system admin who maintained the server retired, Statalist migrated to an online forum open to all (www.statalist.org). This is no different from the 51,619 postings on the Mplus discussion boards that can be found at http://statmodel.com/cgi-bin/discus/discus.cgi. There are also other venues where end users are more likely to post questions based on context. For example, I’d be much more likely to pose a question about Structural Equation Modeling to SEMNET than I would on LinkedIn purely based on the qualifications of the SEMNET participants. This isn’t to see that LinkedIn may not provide a useful forum, but that the probability of getting higher quality responses in conjunction with the decreased need to further vet the response would make SEMNET more appealing in that case. I don’t think a bit of webcrawling to identify these more specialized venues would be too extensive of an effort and would surely yield a much higher quality metric as you would gain a more precise measure of within platform user engagement from which to make between platform comparisons (you could also potentially normalize based on individual users who may be the major driving forces behind a given platform).

        1. Hi William,

          Thanks for clarifying that. When it moved, I thought Statcorp opened an internal support group as SAS Institute and IBM (for SPSS) have done. I definitely agree with you in principle. However, I think finding the best support group for each of the 27 pieces of software I’m following would be a challenge. Once found, each may display different metrics such as questions vs. total posts; some may display no metrics. If you’re game, give it a shot & I’ll add it! I do appreciate the comment & I’ll poke around at it when I get some time. I used to track only the major stat packages by year but stopped this year because the number of places for support had splintered so much.

          I’m back after some exploration & adding to this reply. Here are some example as to how difficult this topic is. The main SPSS support list is now here:
          https://developer.ibm.com/predictiveanalytics/forums/
          but there’s no way to see how many posts there are. Here’s one of the main SAS support sites:
          https://listserv.uga.edu/cgi-bin/wa?A0=SAS-L
          where you can count the posts, but they’re listed by week! So you have to open a week, manually count the number (e.g. paste into Excel & check the number of rows) and do that 52 times per year. I actually did that through 2012, but the numbers on all the main lists are dropping due, I think, to the proliferation of support sources. I did a search for “the” across a year, hoping I would do this all in one fell swoop, but they only return 50 at a time.

          On SAS Institute’s main community support site (https://communities.sas.com/), do say there are 220,101 total posts there and I know from previous work that there are 339,498 posts on SAS-L through 2012.

          So while your approach would be ideal, I don’t see a way to get the data. But I figured that by using none of the vendor-supported sites, it would at least level the playing field. I believe that Statalist is at least recommended by the company as the main place to get help.

          Cheers,
          Bob

  2. Maybe you should consider postings on Statalist.org and/or include the archives from the original Statalist (the listserv archives are still accessible online). It seems fairly unrealistic that there would be a greater number of SPSS users conversing (which supports your argument about the GUI-based usage), but in a little over 1 year there have been 52,587 postings on the Statalist website. It seems a bit biased not to include forums where these users typically exchange information (e.g., it would be no different than leaving out any of the R-based discussion groups/listservs).

    1. Hi William,

      I would love to have such data from at least the top 10 vendors. But since I can’t get it from all of them, it seemed more biased to talk about only a few. But I’m glad you posted the figure for Statalist. That at least gives people other data to consider. If other vendors provide their figures, I will most happily post them!

      Cheers,
      Bob

      1. This isn’t a vendor figure. The Statalist is a community forum. For the first 10-15 years or so it was a listserv hosted at Harvard. When the system admin who maintained the server retired, Statalist migrated to an online forum open to all (www.statalist.org). This is no different from the 51,619 postings on the Mplus discussion boards that can be found at http://statmodel.com/cgi-bin/discus/discus.cgi. There are also other venues where end users are more likely to post questions based on context. For example, I’d be much more likely to pose a question about Structural Equation Modeling to SEMNET than I would on LinkedIn purely based on the qualifications of the SEMNET participants. This isn’t to see that LinkedIn may not provide a useful forum, but that the probability of getting higher quality responses in conjunction with the decreased need to further vet the response would make SEMNET more appealing in that case. I don’t think a bit of webcrawling to identify these more specialized venues would be too extensive of an effort and would surely yield a much higher quality metric as you would gain a more precise measure of within platform user engagement from which to make between platform comparisons (you could also potentially normalize based on individual users who may be the major driving forces behind a given platform).

        1. Hi William,

          Thanks for clarifying that. When it moved, I thought Statcorp opened an internal support group as SAS Institute and IBM (for SPSS) have done. I definitely agree with you in principle. However, I think finding the best support group for each of the 27 pieces of software I’m following would be a challenge. Once found, each may display different metrics such as questions vs. total posts; some may display no metrics. If you’re game, give it a shot & I’ll add it! I do appreciate the comment & I’ll poke around at it when I get some time. I used to track only the major stat packages by year but stopped this year because the number of places for support had splintered so much.

          I’m back after some exploration & adding to this reply. Here are some example as to how difficult this topic is. The main SPSS support list is now here:
          https://developer.ibm.com/predictiveanalytics/forums/
          but there’s no way to see how many posts there are. Here’s one of the main SAS support sites:
          https://listserv.uga.edu/cgi-bin/wa?A0=SAS-L
          where you can count the posts, but they’re listed by week! So you have to open a week, manually count the number (e.g. paste into Excel & check the number of rows) and do that 52 times per year. I actually did that through 2012, but the numbers on all the main lists are dropping due, I think, to the proliferation of support sources. I did a search for “the” across a year, hoping I would do this all in one fell swoop, but they only return 50 at a time.

          On SAS Institute’s main community support site (https://communities.sas.com/), do say there are 220,101 total posts there and I know from previous work that there are 339,498 posts on SAS-L through 2012.

          So while your approach would be ideal, I don’t see a way to get the data. But I figured that by using none of the vendor-supported sites, it would at least level the playing field. I believe that Statalist is at least recommended by the company as the main place to get help.

          Cheers,
          Bob

  3. Bob,

    On LinkedIn, the number of members alone is a misleading statistic; the level of engagement matters. Gregory Piatetsky-Shapiro nailed it on KDnuggets:

    http://www.kdnuggets.com/2015/05/top-linkedin-groups-analytics-big-data-mining-activity-engagement.html

    Lavastorm is an ankle-biter in analytics, but they have a LinkedIn group and actively spam people to join the group. I suspect that the vast majority of those who join have never read, posted or contributed.

    Regards,

    Thomas

    1. Hi Thomas,

      I figured that’s what Lavastorm was doing. It was way too suspicious that they were so popular but only on one site. In no other measure did Lavastorm end up anywhere but near the bottom…except of course their own survey of users! Thanks for the reference to Gregory’s work on the LinkedIn. I forgot to mention in the text that some software, notably SAS, had a half-dozen or so LinkedIn groups. I gave each the benefit of a doubt by choosing the one that had the greatest number of members.

      Cheers,
      Bob

      1. Bob,

        Another thing you can do with LinkedIn is use Advanced Search to count people with selected keywords. For example, there are 429,398 LinkedIn users with “SAS” as a keyword, and 948,803 users with “SPSS” as a keyword, and 3,213,601 with “R” as a keyword. Of course “R” is subject to the usual caveats :-), but when I looked through the hits they actually did look like R users.

        If you’re willing to pay LinkedIn for a higher membership level, you can query on membership in selected groups as well.

        Regards,

        Thomas

  4. Bob,

    On LinkedIn, the number of members alone is a misleading statistic; the level of engagement matters. Gregory Piatetsky-Shapiro nailed it on KDnuggets:

    http://www.kdnuggets.com/2015/05/top-linkedin-groups-analytics-big-data-mining-activity-engagement.html

    Lavastorm is an ankle-biter in analytics, but they have a LinkedIn group and actively spam people to join the group. I suspect that the vast majority of those who join have never read, posted or contributed.

    Regards,

    Thomas

    1. Hi Thomas,

      I figured that’s what Lavastorm was doing. It was way too suspicious that they were so popular but only on one site. In no other measure did Lavastorm end up anywhere but near the bottom…except of course their own survey of users! Thanks for the reference to Gregory’s work on the LinkedIn. I forgot to mention in the text that some software, notably SAS, had a half-dozen or so LinkedIn groups. I gave each the benefit of a doubt by choosing the one that had the greatest number of members.

      Cheers,
      Bob

      1. Bob,

        Another thing you can do with LinkedIn is use Advanced Search to count people with selected keywords. For example, there are 429,398 LinkedIn users with “SAS” as a keyword, and 948,803 users with “SPSS” as a keyword, and 3,213,601 with “R” as a keyword. Of course “R” is subject to the usual caveats :-), but when I looked through the hits they actually did look like R users.

        If you’re willing to pay LinkedIn for a higher membership level, you can query on membership in selected groups as well.

        Regards,

        Thomas

  5. I think it should be noted that SAS, SPSS and other commercial packages are systematically underestimated because their license include vendor tech support through non-publicly available manners (phone line, e-mail, closed forum). Virtually every user of these packages may contact vendor and be sure to receive some kind of answer. In R, only minority of users can go the same way.

    I don’t think that this factor is important enough to turn order of packages around, but I guess it’s worth noting anyway.

    1. Hi Miroslaw,

      That’s a good point. I’ve used SAS Institute’s and SPSS’ tech support extensively and they’re both superb. You can purchase similar support for open source software, but as you point out, only a small proportion of users have do so.

      Cheers,
      Bob

  6. I think it should be noted that SAS, SPSS and other commercial packages are systematically underestimated because their license include vendor tech support through non-publicly available manners (phone line, e-mail, closed forum). Virtually every user of these packages may contact vendor and be sure to receive some kind of answer. In R, only minority of users can go the same way.

    I don’t think that this factor is important enough to turn order of packages around, but I guess it’s worth noting anyway.

    1. Hi Miroslaw,

      That’s a good point. I’ve used SAS Institute’s and SPSS’ tech support extensively and they’re both superb. You can purchase similar support for open source software, but as you point out, only a small proportion of users have do so.

      Cheers,
      Bob

  7. StataCorp (as it now is; note spelling) has had technical support staff since it started, and thus long before its move to Texas, if that is what you are alluding to. So, as you and others surmise, there is substantial traffic in Stata support questions, which remains private to the company and the users concerned.

    It is hard for anyone to keep up to speed even roughly on what is happening. Twitter, Facebook and Reddit carry everything from gossip, groans and opinionated advocacy or abuse of statistical software to serious technical questions about particular programs. You would, naturally, have a hard job at processing their websites to assess interest systematically and quantitatively.

    1. Hi Nick,

      I thought that Statalist was on a listserv at one location which was later moved to another. I could be totally off base though since I follow so many tools that it’s pretty crazy keeping track of it all. SAS and SPSS had both set up internal alternatives to the SAS-L and SPSSX-L listservs, and I thought StataCorp had done the same. Thanks for clarifying!

      Cheers,
      Bob

  8. StataCorp (as it now is; note spelling) has had technical support staff since it started, and thus long before its move to Texas, if that is what you are alluding to. So, as you and others surmise, there is substantial traffic in Stata support questions, which remains private to the company and the users concerned.

    It is hard for anyone to keep up to speed even roughly on what is happening. Twitter, Facebook and Reddit carry everything from gossip, groans and opinionated advocacy or abuse of statistical software to serious technical questions about particular programs. You would, naturally, have a hard job at processing their websites to assess interest systematically and quantitatively.

    1. Hi Nick,

      I thought that Statalist was on a listserv at one location which was later moved to another. I could be totally off base though since I follow so many tools that it’s pretty crazy keeping track of it all. SAS and SPSS had both set up internal alternatives to the SAS-L and SPSSX-L listservs, and I thought StataCorp had done the same. Thanks for clarifying!

      Cheers,
      Bob

  9. Hi,

    CrossValidated is part of the StackExchange network, specific to statistics. To use their numbers you should also include other related SE sites:

    * StackOverflow is for programming questions and has tags for R, Python, Stata, Matlab and more (http://stackoverflow.com/)
    * They have one dedicated to Mathematica (http://mathematica.stackexchange.com/)

    Somewhat related, Cross-Validated has a meta post about which external websites could be used to ask language-specific questions: http://meta.stats.stackexchange.com/questions/793/internet-support-for-statistics-software
    The number of votes for each language indicates how often this information was considered ‘useful’ and R is the clear winner there as well.

    1. Hi Freek,

      Thanks for that very useful link! Ideally, I would like to compare the number of posts per year per software across all sites. It would even be good to plot that type of data for the “best” support site for each package. However, given my time constraints, all I could do was to pick a few that were used by many packages and whose data were easy to get.

      Thanks,
      Bob

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.