The field of data science is changing so rapidly that it’s quite hard to keep up with it all. When I first started tracking The Popularity of Data Science Software in 2010, I followed only ten packages, all of them classic statistics software. The term data science hadn’t caught on yet, data mining was still a new thing. One of my recent blog posts covered 53 packages, and choosing them from a list of around 100 was a tough decision!
To keep up with the rapidly changing field, you can read the information on a package’s web site, see what people are saying on blog aggregators such as R-Bloggers.com or StatsBlogs.com, and if it sounds good, download a copy and try it out. What’s much harder to do is figure out how they all relate to one another. A helpful source of information on that front is the book Disruptive Analtyics, by Thomas Dinsmore.
I was lucky enough to be the technical reviewer for the book, during which time I ended up reading it twice. I still refer to it regularly as it covers quite a lot of material. In a mere 262 pages, Dinsmore manages to describe each of the following packages, how they relate to one another, and how they fit into the big picture of data science:
- Alluxio
- Alpine Data
- Alteryx
- APAMA
- Apex
- Arrow
- Caffe
- Cloudera
- Deeplearning4J
- Drill
- Flink
- Giraph
- Hadoop
- HAWQ
- Hive
- IBM SPSS Modeler
- Ignite
- Impala
- Kafka
- KNIME Analytics Platform
- Kylin
- MADLib
- Mahout
- MapR
- Microsoft R Aerver
- Phoenix
- Pig
- Python
- R
- RapidMiner
- Samza
- SAS
- SINGA
- Skytree Server
- Spark
- Storm
- Tajo
- Tensorflow
- Tez
- Theano
- Trafodion
As you can tell from the title, a major theme of the book is how open source software is disrupting the data science marketplace. Dinsmore’s blog, ML/DL: Machine Learning, Deep Learning, extends the book’s coverage as data science software changes from week to week.
I highly recommend both the book and the blog. Have fun keeping up with the field!
2 thoughts on “Keeping Up with Your Data Science Options”