Learn to Manage Data at useR! 2014 or Online April 25

Before you can analyze data, it must be in the right form. Join me on April 25th for a 4-hour webinar that shows how to perform the most commonly used data management tasks in R. We will work through hands-on examples of R’s popular add-on packages such as plyr, reshape, stringr, lubridate and sqldf. I’ll also be presenting a 3-hour version at the UseR! 2014 conference. Here’s a list of the topics covered:

  1. Transformation basics
  2. Conditional transformations
  3. Summarization of columns and rows
  4. Summarization by group
  5. Analysis by group
  6. Sorting data
  7. Selecting first or last observation per group
  8. Miscellaneous variable tools (rename, keep, drop)
  9. Stacking data frames
  10. Finding and removing duplicate observations
  11. Merging data frames
  12. Reshaping data frames
  13. Character string manipulations
  14. Date / time manipulations (not in shorter useR! presentation)
  15. Using SQL within R (not in shorter useR! presentation)

R--113Many examples come from my books, R for SAS and SPSS Users and R for Stata Users. That makes it easy to review what we did later with full explanations, or to learn more about a particular subject by extending an example which you have already seen.

At the end of the workshop, you will receive a set of practice exercises for you to do on your own time, as well as solutions to the problems. I will be available via email at any time in the future to address these problems or any other topics in my workshops or books. I hope to see you there!

SAS, SPSS, Stata Users: Learn R from Home April 21

R--67

Has learning R been driving you a bit crazy? If so, it may be that you’re “lost in translation.” On April 21 and 23, I’ll be teaching a webinar, R for SAS, SPSS and Stata Users. With each R concept, I’ll introduce it using terminology that you already know, then translate it into R’s very different view of the world. You’ll be following along, with hands-on practice, so that by the end of the workshop R’s fundamentals should be crystal clear. The examples we’ll do come right out of my books, R for SAS and SPSS Users and R for Stata Users. That way if you need more explanation later or want to dive in more deeply, the book of your choice will be very familiar. Plus, the table of contents and the index contain topics listed by SAS/SPSS/Stata terminology and R terminology so you can use either to find what you need. A complete outline of the workshop plus a registration link is here.

Analytics Software Popularity Update: Counting Blogs, Simplifying Job Searches

My latest update to The Popularity of Data Analysis Software is an attempt to use blog counts to estimate the popularity of analytics software. While I was able to greatly broaden the coverage of packages when studying job data, I made very little progress on the blog measure, adding new coverage for only Python and updating the previous counts for Stata. I post the results here mostly as a request for input from people who may know of more sources of blog lists than I have found so far.

I’ve also updated the jobs data slightly both in the main article and in the background one, How to Search for Data Science Jobs. While the changes to the search algorithm are greatly simplified, but worth reading only by people who are doing their own searches. Rather than jump through hoops to estimate total jobs for each software, I only count those for the main set of search terms.  The relative results from the new search algorithm are nearly identical to the previous, more complex one (r = .99).

Here’s the update on blogs:

Blogs

On Internet blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Therefore, the number of bloggers writing about analytics software has potential as a measure of popularity or market share. Unfortunately, counting the number of relevant blogs is often a difficult task. General purpose software such as Java, Python, the C language variants and MATLAB have many more bloggers writing about general programming topics than just analytics. But separating them out isn’t easy. The name of a blog and the title of its latest post may not give you a clue that it routinely includes articles on analytics.

Another problem arises from the fact that what some companies would write up as a newsletter, others would do as a set of blogs, where several people in the company each contribute their own blog, but they’re also combined into a single company blog. Statsoft and Minitab offer examples of this. What’s really interesting is not company employees who are assigned to write blogs, but rather volunteers who freely provide their time.

In a few lucky cases, lists of such blogs are maintained, usually by blog consolidators, who combine many blogs into large “metablogs.” All I have to do is find such lists and count the blogs. I don’t attempt to extract the few vendor employees that I know are blended into such lists. However, I skip those lists that are exclusively employee-based (or very close to it). The results are shown in Table 1.

         Number
Software of Blogs Source
R         452     R-Bloggers.com
Python     60     SciPy.org
SAS        40     PROC-X.com, sasCommunity.org Planet
Stata      11     Stata-Bloggers.com

Table 1. Number of blogs devoted to each software package on March 5, 2014, and the source of the data.

R’s 452 blogs is quite an impressive number. For Python, I could only find that list of 60 that were devoted to the SciPy subroutine library. Some of those are likely cover topics besides analytics, but to determine which never cover the topic would be quite time consuming. The 40 blogs about SAS is still an impressive figure given that Stata was the only other software that even garnered a list anywhere. That list is at the vendor itself, StataCorp, but it consists of non-employees except for one.

While searching for lists of blogs on other software, I did find individual blogs that at least occasionally covered a particular topic. However, keeping this list up to date is far too time consuming given the relative ease with which other popularity measures are collected.

If you know of other lists of relevant blogs, please let me know and I’ll add them. If you’re a software vendor CEO reading this, and your company does not build a metablog or at least maintain a list of your bloggers, I recommend taking advantage of this important source of free publicity.