Book review: Doing Data Science: Straight Talk from the Frontline

Book review
Doing Data Science: Straight Talk from the Frontline By Cathy O’Neil, Rachel Schutt
Publisher: O’Reilly Media

Full disclosure: I received a free electronic copy of this book to review as part of O’Reilly’s Reader Review Program

This book seems to be everywhere at the moment, largely, I imagine, because there’s a gaping hole in the market for such a book. Indeed, there’s a gaping hole in the market, it seems, for everything to do with data science, from books and courses to the actual people themselves. Famously, the world toiled away without even a Wikipedia entry for data science until the relatively recent past.

Data science is a vast and complex field and this book is a lighthearted but serious attempt to describe and delimit the science both by describing the methods of data science (web scraping, data munging, linear regression, machine learning) and also by describing the actual work and practices of real data scientists.

After a couple of initial chapters of scene-setting and describing some of the basic concepts each chapter features a guest contributor who shares their particular work and the tools which they employ. They’re quite a diverse bunch, people who have worked in startups, people from big tech companies, data artists, and Kaggle winners, amongst others. Despite the diversity of subject matter the book manages to keep an even tone and the chapters are linked together in a coherent fashion. I congratulate the authors on this particularly because I imagine it was a very difficult task that will likely go under-appreciated. Individuals of a certain persuasion, i.e. anyone who agrees with Jeff Hammerbacher when he says “The best minds of my generation are thinking about how to make people click ads… That sucks.”, will be perhaps a little disappointed here, because a decent portion of the material relates to click through, A/ B testing of ads, and so on.

Nonetheless, speaking as somebody who thinks in data science more in terms of what it can offer the collective good of humanity, not the pockets of shareholders, I found the material of interest and it’s quite clear in most cases how these tools could be repurposed, for example in my own field of health. It goes with the territory a bit, really, a lot of the innovation in big data has come from millions of clicks, user profiles, and their ilk, and I suppose I could grudgingly concede that my own field has benefited from these methods and technologies.

Putting that to one side, it’s hard not to love this book, and it’s easy to recommend to anyone with an interest in this area. Everyone will approach the material with their own perspective, and it’s fascinating to read such a wide ranging tour of the types of task and the methods used in modern data science. My own training is a quantitative based PhD in psychology and I make a lot of use of regression models in my work. I confess, my background is strictly frequentist and although I try to think like a Bayesian I certainly can’t analyse data like one. So I felt right at home in all the regression based chapters, and enjoyed reading about the other types of tasks for which they can be used, and was just a tourist in some of the other chapters, random forests, k nearest neighbours, and so on. Data science does embody a particular worldview and it was very interesting to read about everyday problems I deal with as an analyst (such as covariates) from a different perspective.

I must commend the book, also, on its refusal to patronise its audience in any way. The tone is a little light for my taste, occasionally, but this is really very subjective and some readers will appreciate the lighthearted use of language. If the authors want to show you an equation, however, they show it to you. If you need to use some calculus, they’ll show you the calculus. Bash scripts, Python, R, even snippets of Go, all are just thrown in there wherever they’re appropriate. The authors are clearly confident that their readership will either make sense of it all or be motivated to find out, and it’s great to read it. For my part, of course, I was totally at home with the R code and the statistical material, sort of okay reading Bash and Python, and, to be honest, a bit lost with some of the maths. But I found I could breeze through, the fundamentals are all explained in natural English and the whole book left me with a severe itching to do a calculus course.

Overall, this book is an important addition to the canon of work describing the methods and philosophy of data science and was a great mix of the how and the what of data science, all presented in an engaging and intellectually stimulating way from the perspective of some of its leading lights. It’s the kind of book that asks more questions than it answers, but you can hardly complain when it asks them in such an interesting way and there can’t be too many books out there that infect their readers with the compulsion to study calculus.

A link to the product page is here.

Boss of year of code can’t code- do we all need to be able to code?

As many of you will be aware, the Twittersphere was recently ignited by Lottie Dexter (Year of code boss can’t code) and her car crash performance on Newsnight. I don’t have much to add to what has already been said apart from to note with sadness yet another typical example of a public servant being involved with technology of which they have no comprehension.

Jeremy Paxman expressed cynicism within the interview regarding the idea that all people need to be able to code, and by extension that all children need to be taught coding. I think this betrays an understanding of what it means to “be able to code”. I would like to illustrate this with a couple of examples.

I recently finished Doing data science which is quite a fascinating tour (my review forthcoming on this blog) not only of the field of data science, the domains in which it is used, and the uses to which it is put but also of the people who call themselves data scientists. They’re a very diverse bunch and their ability to code is highly variable. Some are hardcore computer scientists with years of experience optimising C++ implementations of machine learning algorithms. Some are reasonably competent, able to produce efficient and accurate Python scripts and munge a bit of data on a Linux command line. And some, by their own admission, are terrible programmers. They can hack together a few R scripts and their skills lie elsewhere. But it’s pretty clear from reading what they do and how they use computers and code that all would have benefited from learning coding at school (no doubt, some of them did).

For my own part, I’m a terrible computer programmer. I make no secret of that. I have been teaching myself out of books for 5 years and have now reached a point where I’m a competent R programmer (but no more) and can write a bit of Java and JavaScript, as well as being just about OK at HTML and LaTeX. A lot of my skills lie within my psychology PhD (in which I made extensive use of mixed effects models) and my long experience working within health care settings which has taught me a lot about how to interact with staff and service users within health settings and how to best serve them with data. There’s no doubt in my mind whatsoever that I would have benefited from learning to code at school.

Even further down the coding food chain, I interact with a lot of people who use data and spreadsheets every day or every week who cannot code. They cannot write macros, they certainly can’t write Visual Basic scripts, and they do, in general, waste fantastically large amounts of time fiddling around with data when a macro or Visual Basic script (or R, or Python…) could do the heavy lifting for them in seconds.

And even further down, right at the bottom, there are people who don’t use spreadsheets, don’t analyse data, and don’t really have occasion to use macros, scripts, markup languages, or anything else. Would they benefit from learning to code at school? I think they would. Jeremy Paxman suggests that we don’t all need to code, and that’s true. We don’t all need to write production quality C++ code or develop Android applications, or design PHP/ MySQL databases. However, we don’t all need to build bridges or write novels or prove the Riemann hypothesis either. We still learn physics and English and maths. Knowingly or unknowingly, all these individuals interact with code and data and they make purchasing and HR decisions that relate to code and data. I’ve been to meetings full of individuals who are all desperate to make use of new technologies, to make purchasing decisions about them, to hire people who can use them, who have absolutely no understanding of how complex a given task is, how it would be achieved, and what the different options are (Local or server-side implementation? Proprietary or open source? Database or spreadsheet?).

In the last five years I’ve learned to ask the same questions again and again of any data process- can a computer do this faster than a human being? Can a computer do this more accurately than a human being? Can a computer do this more cheaply than a human being? What are the tradeoffs between speed, accuracy, and cost if we assign different parts of the task to computers and human beings?

I ask these questions of any process and I teach the people around me the answers to these questions as well as encouraging them to begin to ask these questions themselves. In the NHS, at least, these questions are fundamental to the working lives of many of the staff that I interact with. They cannot begin to ask these questions without some basic understanding of the way that computers work.

So, to answer Jeremy Paxman’s question, we don’t all need to be able to code, just like we don’t all need to make cars or operate nuclear power stations. But a grounding in the fundamentals of computer science is essential in today’s workforce. Predictions about the future are always wrong, my favourite perhaps being:

“Where a calculator like the ENIAC today is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have only 1,000 vacuum tubes and perhaps weigh only 1.5 tons”- Andrew Hamilton (1949).

However, it seems sure that the foregoing will remain true for the next 30 years at least.