Book review
Doing Data Science: Straight Talk from the Frontline By Cathy O’Neil, Rachel Schutt
Publisher: O’Reilly Media
Full disclosure: I received a free electronic copy of this book to review as part of O’Reilly’s Reader Review Program
This book seems to be everywhere at the moment, largely, I imagine, because there’s a gaping hole in the market for such a book. Indeed, there’s a gaping hole in the market, it seems, for everything to do with data science, from books and courses to the actual people themselves. Famously, the world toiled away without even a Wikipedia entry for data science until the relatively recent past.
Data science is a vast and complex field and this book is a lighthearted but serious attempt to describe and delimit the science both by describing the methods of data science (web scraping, data munging, linear regression, machine learning) and also by describing the actual work and practices of real data scientists.
After a couple of initial chapters of scene-setting and describing some of the basic concepts each chapter features a guest contributor who shares their particular work and the tools which they employ. They’re quite a diverse bunch, people who have worked in startups, people from big tech companies, data artists, and Kaggle winners, amongst others. Despite the diversity of subject matter the book manages to keep an even tone and the chapters are linked together in a coherent fashion. I congratulate the authors on this particularly because I imagine it was a very difficult task that will likely go under-appreciated. Individuals of a certain persuasion, i.e. anyone who agrees with Jeff Hammerbacher when he says “The best minds of my generation are thinking about how to make people click ads… That sucks.”, will be perhaps a little disappointed here, because a decent portion of the material relates to click through, A/ B testing of ads, and so on.
Nonetheless, speaking as somebody who thinks in data science more in terms of what it can offer the collective good of humanity, not the pockets of shareholders, I found the material of interest and it’s quite clear in most cases how these tools could be repurposed, for example in my own field of health. It goes with the territory a bit, really, a lot of the innovation in big data has come from millions of clicks, user profiles, and their ilk, and I suppose I could grudgingly concede that my own field has benefited from these methods and technologies.
Putting that to one side, it’s hard not to love this book, and it’s easy to recommend to anyone with an interest in this area. Everyone will approach the material with their own perspective, and it’s fascinating to read such a wide ranging tour of the types of task and the methods used in modern data science. My own training is a quantitative based PhD in psychology and I make a lot of use of regression models in my work. I confess, my background is strictly frequentist and although I try to think like a Bayesian I certainly can’t analyse data like one. So I felt right at home in all the regression based chapters, and enjoyed reading about the other types of tasks for which they can be used, and was just a tourist in some of the other chapters, random forests, k nearest neighbours, and so on. Data science does embody a particular worldview and it was very interesting to read about everyday problems I deal with as an analyst (such as covariates) from a different perspective.
I must commend the book, also, on its refusal to patronise its audience in any way. The tone is a little light for my taste, occasionally, but this is really very subjective and some readers will appreciate the lighthearted use of language. If the authors want to show you an equation, however, they show it to you. If you need to use some calculus, they’ll show you the calculus. Bash scripts, Python, R, even snippets of Go, all are just thrown in there wherever they’re appropriate. The authors are clearly confident that their readership will either make sense of it all or be motivated to find out, and it’s great to read it. For my part, of course, I was totally at home with the R code and the statistical material, sort of okay reading Bash and Python, and, to be honest, a bit lost with some of the maths. But I found I could breeze through, the fundamentals are all explained in natural English and the whole book left me with a severe itching to do a calculus course.
Overall, this book is an important addition to the canon of work describing the methods and philosophy of data science and was a great mix of the how and the what of data science, all presented in an engaging and intellectually stimulating way from the perspective of some of its leading lights. It’s the kind of book that asks more questions than it answers, but you can hardly complain when it asks them in such an interesting way and there can’t be too many books out there that infect their readers with the compulsion to study calculus.
A link to the product page is here.