Data science panel questions

I was on a panel the other day talking about data science careers (I guess there will be a link to it somewhere at some point, I’ll share it when I see it) and they’ve sent some questions out after the event because they didn’t get through them all.

I may as well answer them here where everyone can see them rather than just have the people who came read them (hmm… might be a general point about working in the open there somewhere 🤔😉). There’s quite a few so I’ll take a few goes to get through them all.

Q1. “To what extent is being an advanced user of Excel and experience of briefing/ presenting data to decision makers a good foundation for upskilling to data science?”

I don’t think using Excel to an advanced level helps at all. It’s a totally different mental model where each change is laboriously carried through cell by cell (which is why you always have loads of hidden tabs on complicated Excel applications). Using DATA to an advanced level clearly helps, whether it’s Excel, SPSS, or anything else. You need to learn (the hard way, usually) to ask the same basic questions every time. “Where did this data come from? Is it reliable? Is it complete? What does it look like if I plot it? Are there spurious correlations in it? What is my hypothesis as to why the data looks like it does? How can I *concretely* test my hypothesis against the data? Etc…”. I don’t want to get into an Excel rant because everyone’s heard it all before but I don’t touch Excel ever for any purpose other than totting up a column of numbers, and for reasons that I consider to be well reasoned and valid. YMMV.

Q2. “Does a data scientist need to be a good statistician?”

Definitionally, a data scientist needs to be better than any computer scientist at statistics, and the corollary applies- they will be worse than any statistician at statistics. Data science is a *very* big field, and some data scientists hardly touch statistics. Some do them all day. I think data scientists need to have a really deep understanding of some of the basic points I mention in Q1, and a lot of domain knowledge about the kinds of data found in their field (for example, there are lots of waits of 3 hours and 55 minutes recorded in A&E departments, for an obvious reason). And they need to understand the methods they’re using. If they’re using regression, they need to understand regression. I think at times all data scientists (including myself) are guilty of using methods that they don’t understand fully, and although you can get away with this for a while eventually you will slip up and look stupid. So do your homework and Keep It Simple, Stupid, would be my advice.

Q3. “To what extent is becoming a data scientist ‘gatekept’ by having a statistical science degree or A level in mathematics?”

I love this question. My experience of the recruitment of data scientists in my industry (the NHS) is that it’s a complete mess. Often data scientists are recruited by people who are not data scientists, because there are not enough of them around and those that are often haven’t had time to work their way up the ladder. I have quite a hardline view on this. DS is gatekept by quantitative degrees but it shouldn’t be. As I said on my panel I personally look for abilities in problem definition, communication, stakeholder engagement, and most importantly teamwork. People who can write version controlled code are ten a penny this days (I mean that respectfully, but it’s true). If you want to distinguish yourself, be kind, in other words. You do not need a quant degree to be a data scientist, I know because I have met many great data scientists without quant degrees, but you may find that the people recruiting you do not agree with this. YMMV

RAP – a 10 year journey

I found the earliest possible reference that I ever made to RAP in a document, part of an evaluation report I submitted nearly 10 years ago. At the time I wrote it I had hoped that it would be ceased on by the people who read it as an obviously important innovation, but I can see now that change is not so easily won. Anyway, just for fun, here it is, June 25th 2012, written in Sweave before all that newfangled RMarkdown (even that’s not newfangled now! Quarto!).

“\subsection{A note on reporting methodology}

This report is \emph{reproducible}. This means that the tables and graphs within it are produced not by cutting and pasting from separate software packages but instead generated using open source computer code and automatically laid out and inserted into the document.

The advantage of working in this way is it allows the report to be re-produced given a different set of data in seconds instead of days. This report can therefore very easily be submitted in subsequent years, with the new data loaded into the program and all of the new tables and figures automatically generated. The text, naturally, would need to be re-written, although it may be felt that the analysis stands on its own and the whole exercise of reporting could therefore be carried out and completed in less than one day at zero cost, if the data were made available.”

Here’s to another 10 years of innovation and change 🙌‍

Data science for dummies (Goldacre)

Building your own tools for data science is a pretty fundamental concept, and I think it’s fair to say that it’s totally alien to most NHS bosses. I shall henceforth be showing them this excellent section from the Goldacre report. It’s not about data science as such, it’s about TREs, but it illustrates the concept beautifully

The data science team at the music-streaming service Spotify do innovative work with data that helps drive the usability and popularity of their subscription service. For example, they extract patterns in the listening behaviour across all their users, and then use this to provide individual users with tailored recommendations for other music they might enjoy. The Spotify data science team couldn’t buy, off the shelf, a data science environment specifically built to service the needs of “a global music streaming service”. They implemented standard off-the-shelf tools for a general purpose data science environment. Then, within that raw environment, they needed to build their own tools, analytic approaches, workflows, data preparation work, and so on. A new arrival in the Spotify data science team today will find modules of code, libraries and packages – some even with nice interactive interfaces – to help them find interesting new patterns in Spotify user data. Many of these tools will feel like part of the furniture, but they were all built by their predecessors in the Spotify data science environment. Furthermore, many of these tools will not have been built to a pre-determined specification, by software developers hired to do that work to order; rather, they will emerge from a team. A single analyst might painstakingly implement a one-off analysis; if it looks like the approach will have broader use, then a more experienced developer might offer to help package it up into a function or library, with good documentation; if it becomes a commonly used approach, they might work with other analysts to create an interactive tool

Better, broader, safer: using health data for research and analysis

(reproduced under the terms of the Open Government Licence)

We need more than “black box” systems (Goldacre)

The system should be cautious around imagining that it can push away the challenge of TREs – and all work with NHS data – by procuring ‘black box’ services. Building platforms, capacity and modern working methods for data is a complex technical challenge, requiring deep knowledge across a range of domains: data science, data architecture, and software development; but also clinical informatics, NHS data needs, health data research, and more. This work must be done close up with real users of data, constantly iterating to improve platforms and approaches. There is no single contract that can pass over responsibility for this work. These new and complex technical challenges around data must be met by building teams, tools, methods, working practices, code and platforms

Better, broader, safer: using health data for research and analysis

(text reproduced verbatim- under the terms of the Open Government Licence)

The prevalence of code sharing (Goldacre)

Another greatest hit from the Goldacare report, in the section on open working, “The prevalence of code sharing” (all material reused under OGL)

  • “ONS covid reports: the team was unable to find any analytic code for the platform or covid analyses (but extensive and excellent open code training elsewhere)
  • OpenSAFELY covid reports: all code for the platform, data management and analysis all shared automatically on GitHub (declaration of interest: BG is PI on OpenSAFELY)
  • PHE covid reports: the team was unable to find any analytic code for PHE reports on topics such as ethnicity and COVID-19; but extensive code sharing for their (excellent) COVID-19 dashboards
  • DECOVID (Turing / HDRUK PIONEER platform created for a wide range of covid research teams from a large number of universities): the team was unable to find code for the platform or analyses
  • ICODA (HDRUK’s flagship COVID-19 data analysis platform initiated in June 2020): the team was unable to find code for the platform or analyses (but also no outputs to date)
  • HDRUK / NHSD / BHF TRE: the team was unable to find code for the platform; but some scripts are shared for a paper describing the data accessible through it, and one research preprint (the platform’s only output to date)”

I love that they did this, but that’s not why I’m writing a blog post about. What I love is the assumption that you should be able to sit at your desk with a web browser and find this stuff. That’s what “open” should mean. I’m so sick to death of being told stuff is “open source” and then I ask where it is they say “I’ll email you a copy”.

I plaster my code all over the internet. I shove it in the face of everyone I can think of to shove it in the face of, because I want people to see it. This “open source but you have to go to a webinar and then email me three times” is for the birds

The Goldacre review- greatest hits

I absolutely love the Goldacare review, I really can’t praise it enough, and I will be doing a lot of work based on it in the coming weeks and months looking at it from the perspective of my own Trust and, (with others) from the perspective of my ICS, and NHS-R. NHS-R has a couple of repos looking at matters related to the review (statement on tools and NHS-R vision) and we need to do some thinking about stuff that particularly comes out of this review that we can look at (which I have started doing on my own, community effort will follow).

Anyway, that’s all for the future. This blog post is just going to be very simple and just be Stuff I Particularly Love. As I’m reading it I occasionally find a bit which really resonates, and I thought it would be useful for me and possibly for others to record them as a kind of “Greatest Hits” outside of all of the other stuff I’m doing digesting it.

The Goldacare review is a blueprint for change, without doubt, that’s partly why I love it so much, and I intend to be that change and to push it forwards, but in the meantime this is just a bit of fun looking at all the best bits

Make it ‘okay to ask’ about access to publicly funded code

“The team heard from several interviewees that at present it is commonly regarded as provocative to ask for access to the code used to implement an analysis, especially in some parts of the academic community, despite general positive statements on open working”

I absolutely love this. I think it is regarded as being provocative sometimes, and people can get quite defensive. Very often people assure me that the code is forthcoming but it never is. It will surprise nobody to hear that I ask for code whether it’s provocative or not, and so should you. If it’s public money, it’s public code, and I’m sorry if that makes you uncomfortable but I’m going to put my hand up at the end and ask. Every. Single. Time. So get used to it

Pseudo-open working

“As a consequence of growing support for open working, there are now individuals and organisations who state that they support open methods, but do not do so; or create only the appearance of open working. During this review the team encountered examples of very senior and influential leaders extolling the virtues of open working, where their published papers from the pandemic in 2021 do not contain code, and require that interested parties contact them personally to negotiate access to the data dictionary codelists used to define the variables used in the analysis”

I almost whooped with delight when I read this. I’ve encountered this so many times and it makes me feel very cross. I hear so many warm words, and I see so many people extolling the virtues of openness, but very often it’s just a sham. They’ve no intention of sharing anything and they wouldn’t know an open licence if it bit them. I have more respect for the naysayers, at least they’re honest.

That will do for this post. I’m sure there will be more, and I’ll add more posts as I go.

In defence of the ordinary

I think this might be a general cultural problem, but I notice it a lot in my field of healthcare analytics. There are lots of “case studies” and “pathfinders”, that kind of thing, groups of people who are doing amazing stuff.

I don’t think that it has the desired effect though. People look at these groups doing incredibly complicated things with new tools and they think it just doesn’t apply to them. Let the pathfinders pathfind and we’ll just churn out some rubbish in Excel, same as last week.

I wonder if we’d be better off setting the bottom higher, rather than raising the bar at the top again. Instead of showing off amazing teams, showcase a solid team getting the basics right, week in, week out. And try to pose the question “If you’re not doing this stuff, why not?”

In defence of looking at jobs

Nobody should ever have to apologise for looking at jobs, however settled they are. Looking at jobs just means one of:

  1. there might be a better job out there and I want to find it if there is
  2. I want to understand what skills people are recruiting so I can learn and do the right stuff
  3. I’m interested in the jobs and skills that are popular now because I want to understand the playing field of human resources in analytics as an actual or potential leader
  4. I want to understand my worth under agenda for change to I can advocate for the development of my professional role (and my salary :wink: ) in my current role

Any manager that doesn’t like that is a fool, a charlatan, a psychopath, or all three, and then you can add:

5. My line manager is a fool, charlatan, or a psychopath (or all three), and I am trying to leave my current job as fast as possible now I have realised this

In defence of holiday working

This might be a little bit controversial, this post. If you hate it please don’t be offended, it’s just my opinion. Totally open to being completely wrong always, especially here.

I’ve heard a lot of people talk about the importance of taking your leave, and encouraging your staff to take your leave, and I do think it’s super important to take leave, and I certainly think if you’re managing someone you should definitely encourage them to take their leave. There is more to life than work, and having your leave can even make you better at your job when you get back, so it’s win-win, you benefit and your job benefits.

However, I’ve had people tell me that it’s wrong that I sometimes work in my leave, because I’m setting a bad example to my staff. This I don’t agree with. Particularly now during COVID but really for the last decade my life has been limited by health problems (primary sclerosing cholangitis, chronic ulcerative colitis, and now hepatic arterial stenosis, but that’s another blog post) and my job has become something I can do that I love that I’m good at, replacing some of the other stuff I used to do (aikido and running, mainly). I’ve been isolating from my whole house on and off since March 2020 and my job really keeps me sane.

So although I may manage people I’m also a human being with my own emotional needs, and I choose to work sometimes in holidays and weekends because it makes me feel good to work and be productive. My whole team completely recognise my position and don’t at all feel any pressure to do likewise.

So I don’t agree with that. The other thing that I don’t agree with that I see sometimes is the idea that we should view others in our team or whom we manage working in the holidays as something “bad” that we need to “save” them from. The best boss I ever had was absolutely brilliant at taking leave. He would work 50+ hours a week and then in the holidays he would just disappear completely. No phone calls, no email, nothing. It was like he didn’t exist. And that was great, and I think it’s a really good example of doing something that works for you (ignoring the 50+ hours, not really advocating that, but he did work hard, put it that way).

But just like me I think some people blend their work in a bit more. They like to tickle their brain a bit in their holidays, maybe write some code that’s been bugging them, do a bit of reading, or write a blog post (writing blog posts absolutely should be part of your paid hours). It avoids the cliff edge stress of the last day before your holiday, because you can think “never mind, I’ll just knock that out in the middle of next week and email it over”. And if you then take a bit of time back for yourself in your core hours, that’s all good too. Because that’s one thing I am good at. I may work funny hours and at the weekend and in my leave but if I feel like having a three hour lunch break with a friend and nobody will miss me I do it without hesitation or guilt and that works really well for me.

So that’s it. I’m not saying “everyone should work in the holidays”, or even “you should try working in your leave, it’s great”. But I am saying that everyone is different, and some people like to blend their work and home lives together for lots of good reasons. As always, what’s important is understanding the needs of the people around us and respecting their autonomy and not coming up with social media friendly soundbites about what you should and shouldn’t do with your self or your team.

In defence of eating

One weird thing in remote/ pandemic times is that people have started turning their cameras off when they eat. I’m all for the right to turn your camera off, I think you should be able to do that at any time without giving a reason, but I think it’s a shame if people think that they can’t eat on camera. I have therefore been glad to eat several large, difficult to eat things in meetings on camera recently, to perhaps give others the idea that it’s okay if they want to do it too.

So far I’ve eaten a footlong Subway which was kind of falling apart really and today it was half of a pretty massive pizza which was also a bit lacking in the structural integrity stakes. So please join me if you wish, we all used to eat in front of each other in The Before Times, it’s totally natural and normal- but also if you don’t want your camera on because you’re eating or for any other reason that’s okay too.