Data science for human beings

Someone just emailed me to ask me about getting into data science. They knew all the usual stuff, linear algebra, Python, all that stuff, so I thought I’d talk about the other side of data science. It’s all stuff I say whenever I talk about data science, but I’ve never written it down so I thought I may as well blog it.

There are three things that are probably harder to learn that will make you stand out at interview and be a better data scientist.

First. Understand your data. I work in healthcare. Some of the data I work with is inaccurate, or out of date, or missing, or misspelled, or just plain wrong. It’s my job to understand these processes. There’s a saying in healthcare, that goes something like “60-80% of apparent differences in healthcare providers are related to different practices with data”

Second. You work in a team. With other data scientists, and with the wider organisation. Don’t be a hero coder, go off to your bunker and write this piece of genius that nobody understands and nobody wants. Work agile. Get buy in as you go. Mentor people. Be humble. Listen to what people want. Don’t do analysis because it’s flashy and cool. Build what your users want. Get to know them. Understand UX and UI. There’s a saying I like that goes “We’re all smart. Distinguish yourself by being kind”. Be a team player. Share the glory, and the blame.

Third. Have an opinion. Don’t just learn every method going and apply them according to whatever posts are saying this month. Scan the horizon. Find new stuff. Dig out old stuff. Think critically about the work that you and others are doing. And sometimes, when you can, go in large. So far in my career I’ve bet large on Shiny and text mining, and they’ve both paid off for the organisation I work for and for me. My latest pick is {golem}. I think it’s going to be massive and I want to be near the front of the pack if and when it is. Trust yourself. It’s your job to support your organisation with their priorities, but it’s your job to know stuff they don’t know and to push your organisation along a bit. I’ve never done anything really significant in my career that somebody asked me to do. I’ve pitched all my really significant projects, although obviously I spend most of my time building stuff people want and ask for (see point two).

NHS data science and software licensing

I’m writing something about software licensing and IP in NHS data science projects at the moment. I don’t think I ever dreamed about doing this, but I’ve noticed that a lot of people working in data science and related fields are confused about some of the issues and I would like to produce a set of facts (and opinions) which are based on a thorough reading of the subject and share them with interested parties. It’s a big job but I thought I’d trail a bit of it here and there as I go. Here’s the summary at the end of the licences section.

The best software licence for a data science project will vary case by case, but there are some broad things to consider when choosing one. The most important decision to make is between permissive and copyleft licences. Permissive licences are useful when you want to maximise the impact of something and are not worried about what proprietary vendors might do with your code. Releasing code under, for example, an MIT licence allows everybody, including individuals using proprietary code, a chance to use the code under that licence.

Using a copyleft licence is useful when there is concern about what proprietary vendors might do with a piece of code. For example, vendors could use some functionality from an open source project to make their own product more appealing, and then use that functionality to sell their product to more customers, and then use that market leverage to help them acquire vendor lock in. Vendor lock in is the state in which using a company’s prodcuts ensures that you find it very difficult to move to another company’s products. An example might be using a proprietary statistical software package and saving data in its proprietary format, making it difficult to transfer that data to another piece of software. If proprietary software companies can use code to make the world worse in this way, then choosing a copyleft licence is an excellent way of sharing your code without allowing anybody to incorporate it into a proprietary codebase. Proprietary software companies are free to use copyleft licensed code, but are highly unlikely to do so since it means releasing all of the code that incorporates it.

The Great Survey Munge

As I mentioned on Twitter the other day, I have this rather ugly spreadsheet that comes from some online survey software that requires quite a lot of cleaning in order to upload it to the database. I had an old version written in base R but the survey has changed so I’ve updated it to tidyverse.

And this is where tidyverse absolutely shines. Using it for this job really made me realise how much help it gives you when you’ve got a big mess and you want to rapidly turn it into a proper dataset, renaming, recoding, and generally cleaning.

It must be half the keystrokes or even less than the base R script it replaces. There are some quite long strings in there, which come from the survey spreadsheet, but that’s all just cut and paste, I didn’t write anything for them.

I love it profoundly, and I bet if I was more experienced I could cut it down even more. Anyway, here it is for your idle curiosity, it is obviously of no use to anybody who isn’t working on this data and to be honest there are bits that you probably won’t even understand, but just a quick look should show you just how much I had from the wonderful tidyverse maintainers