R book review: Two approaches to learning data science
In 2012, the Harvard Business Review declared data scientist “the sexiest job of the 21st century”. Six years on it is still “America’s Hottest Job”, according to Bloomberg. As a statistician or data analyst you might be willing to give it a try. Of course, you need to know R. And both Modern Data Science with R (MDSR) and R for Data Science (R4DS) may help you find your way.
They are both awesome books, written clearly and with just a hint of humour, that teach you how to do data science using R and the “tidyverse”, each one with its own point of view and approach: MDSR is about data science made using R, while R4DS is about R and how to use it for data science applications.
The aim of R4DS is to teach R and, in particular, the tidyverse – a suite of powerful R packages created by one of the authors, Hadley Wickham, to make R more palatable. Although the book does not deal with every technique a data scientist might use, it covers the tasks all data scientists will need: how to explore data, work with data, create models and communicate results. This is accomplished by describing the tools in-depth and showing example applications.
MDSR takes a wider view on data science, understood as a field that inherits from statistics and computer science. Its aim is to educate data scientists in a wide range of competencies, from data wrangling and visualization to using spatial data, text data and network science, including enticing applications, such as scraping data from the internet or social media. Also, even though the book is not about “big data”, it provides a short introduction on the subject. Rather than making you proficient in a specific area, MDSR helps you start tinkering around with different types of analyses on many interesting data sets (on baseball, US flights and airports, American elections, car models, baby names, etc.), a few of them also used by R4DS. As expected from the title, the book mainly uses R, but unexpectedly it also includes an introduction to SQL, nonetheless essential for any data scientist.
Although R4DS is rather more about the tools than about statistics, I liked the intuitive and progressive way of explaining some statistical concepts. For example, it explains model fitting through a transition from Monte Carlo techniques to optimization to least squares. Still, MDSR delves deeper into statistical issues with a chapter on statistical foundations, including timeless hits such as outliers, confounding factors and the perils of p-values. In addition, while R4DS focuses on exploration and predictive models, MDSR goes further by also introducing supervised and unsupervised learning models, simulation and model evaluation.
Certainly, there are many similarities between both books. They promote learning by running the example code, and through exercises that invite you to discover functionalities not covered in the text. Both books, and most especially R4DS, defend the importance of writing clear code and of functional programming: you – and others – need to understand what the code is doing and avoid errors and unnecessary repetitions. Also, both books stress the importance of communicating effectively through graphics and presentations. Still, the treatment is much more complete in R4DS, with several chapters dedicated to graphics and R Markdown, a framework to prepare reports and documents and to facilitate collaboration.
I cannot recommend one book over the other: they are complementary. R4DS describes the tools in such great detail you start thinking you should also read Advanced R, another of Wickham’s books, to finally learn R properly. But MDSR has unique features, such as an outstanding chapter on professional ethics, and extensive references for further study where one can find all the extra details needed. They are both must-have titles for anyone working in the “sexiest job of the 21st century”.
About the books
Modern Data Science with R is by Benjamin S. Baumer, Daniel T. Kaplan and Nicholas J. Horton, published by Chapman and Hall/CRC.
R for Data Science is by Hadley Wickham and Garrett Grolemund, published by O’Reilly.
About the reviewer
Jordi Prats is an environmental modeller at Irstea, Aix-en-Provence, France.