Shoddy data

You know, naming no names because I’ll get in trouble but someone somewhere has paid for some data from a proprietary vendor and they’re shipping absolutely unusable garbage data.

They won’t fix it because “no-one else has complained”.

HAVE SOME PRIDE IN WHAT YOU’RE DOING. How about that? How about fixing it because it’s an embarrassment? I’m a random idiot in some random NHS Trust in the countryside and I couldn’t sleep if my database looked like what you’re sending.

I swear on my life the NHS could do 99% of this stuff better itself if we just dug in and had a go. Everyone’s in thrall to these people with expensive watches and glossy brochures but it’s all a confidence trick.

Managing data science teams in the trenches

We’re systematically devaluing management and leadership in the analyst space, and I encounter a lot of people who think that a day with no code written is a day wasted. I do write code, but I do a lot of other valuable stuff too. I’ve written this post as someone who is very new to managing people with my personal opinion, so just like my posts on proxied SSL with Apache, there is your caveat emptor πŸ™‚

The people I manage like to get in the zone and write code. I think of it as like they’re digging a trench. They’re not getting out of the trench, they’re not sticking their head out, they’re just digging. And it’s efficient, they’re digging fast, and it’s rewarding too. They’re working on interesting problems and solving them and it feels good. My job is just to stick my head in the trench occasionally.

“What are you digging there? Oh yes, I like it, nice. You know, you could use that other tool over there on that bit. And actually, have you thought about trying this other technique? Let me show you, pass me the chisel. Okay, great. Let me know if you get stuck”

next trench… “Hey, that looks good, what is it? I like this trench, I like this bit. You know, you’re actually going the wrong way, though, you’ve gone off at an angle. Come up here and I’ll show you. See that trench over there? We’re trying to get there. Just jump back in and I’ll help you get it pointing in the right direction”

next trench… “Hey, this trench is taking ages, isn’t it?! We need to finish it soon, really. I think we maybe need to not finish some bits of it for now. Which are the really important bits, do you think? Okay, let’s keep this bit. This other bit, I know you love it, and I love it too. It’s really clever what you’ve done. But we can’t ship on time with it. I think we need to come back to it later, or maybe just chalk it up to experience and maybe we can do it next time”

This is pretty basic stuff, really. It feels a bit silly to have to spell it out. The point I’m trying to make is that the people getting in the trench occasionally are helping the people in the trench stay in the trench and that’s good for them and good for the project. And that’s my test for all the stuff that we do. Code review. Standup. Appraisals. Hack days. Overall, is it helping them stay in the trench, digging quickly and solving problems, or is it just a silly distraction?

I’ve never in my life had a manager who has the slightest idea what autoregression is or how to test the assumptions of OLS regression. I’m really happy now that I do manage people and that I do know what those things are, and I hope the people I manage can benefit from my understanding of what they’re doing (even if they usually know way more about it than I do- I can at least keep up if they explain it to me).

(if you’ve read my previous post and you’re paying attention you’ll realise that I’ve just contradicted myself by saying firstly that analysts should do more than just write code all day, and then said in this post that a manager’s job is to help analysts write code all day. Forgive me. In reality the people in the trench have got other people in the trench, or possibly in a neighbouring trench, and it’s their job to stick their head in that trench and help that person write code instead of needing to get out of their trench, before coming back to their own trench. Indeed, really, the trench visitor is in fact in their own trench, and someone else sticks their head in there occasionally, but the analogy gets a bit convoluted and silly at this point. I think you get the idea, even if it doesn’t translate to actual trenches within trenches within trenches)

What do we want from senior analysts in the NHS?

I’ve been meaning to write this for ages and I’ve just had a rather silly conversation on Twitter about managers and leaders in the NHS which has inspired me to get on with it. I think most people are agreed that we have A Problem with mangers and leaders in the analytic space. Much ink has been spilled in the subject, and I think that the two main conclusions of all this ink are broadly accepted by everyone. They are, briefly:

1: Managers often lack the analytical training and experience to critically engage with the data and analyses with which they are presented (or indeed, to do analyses well themselves) and this leads to decision by anecdote and hunch.

2: Analysts lack a clear career track that rewards them for developing their analytic skills rather than for their becoming managers, unlike, say, surgeons, who can be very successful by just being very good surgeons and aren’t expected to become managers (although there’s a bit of nuance even in there that I’m getting to so stick with me if you’re thinking I’m talking rubbish already)

I’m not really going to talk about point number one except to say that in my view part of the solution to this problem is to have analysts in more senior positions, including at board, and by that I mean actual analysts, not managers of analysts. Many people with much more knowledge and experience (and credibility) than me have said this already, so you didn’t really come here to read that. All this does relate to point number two, however, which I will talk about now.

Having advanced some way in my career as an analyst, I can say that my experience absolutely fits with the general point that people make about analyst careers. Quite honestly, there wasn’t really a career structure that I fitted into when I finished my PhD in 2009, and I’ve sorted of bobbed around doing different stuff. I’ve never looked at job and thought “That’s the job for me, I’m applying for that”. I’ve pretty much just made it up as I went along.

The thing I’m not sure about, however, is this idea that we need to reward analysts just for their analytical skills. Sometimes when I talk to people about this I get the idea that they’re promoting the idea that we’ll have these Python geniuses just sitting in a box doing linear algebra and deep learning and advancing up the career track like that. To me I think that misunderstands what we want from analysts in the NHS. I believe the likes of Google and Facebook do pay some very, very clever people to just work on really high end statistical and computational tasks, and they use the methods that these people are producing to increase revenue. I think we in the NHS look at that and we imagine that we can make that work for us. But the NHS is not Google, and most analytical teams are far too small to support an individual working that way. There may be scope to employ people in that capacity in the national bodies, which are much larger. I don’t claim to know much about that. But speaking as someone who works in a provider trust and who has worked with people all over the country in trusts and CCGs, I’m pretty confident in my belief that we actually want to reward people to do other stuff than just clever Python and machine learning.

We do want to employ people with high end technical skills. Source control. Statistics. Machine learning. But once they’re good at that stuff we want more from them. I don’t want them sitting in a box writing code all day. That is a much too narrow definition of working effectively. They need to be training. Mentoring. Doing high level design. Understanding the local, regional, and national landscape. Writing and winning funding applications. Even if they’re not line managing anybody they need to be aware of what more junior colleagues are doing, helping them prioritise and understand their work, and managing projects so they’re delivered on time and to spec.

And therein lies the heart of what I consider to be the mythology around this issue. This stuff is hard. Universities are churning out people who can write Python for data science at a rate of knots now. We’ve got it the wrong way round. Yes, those people at Google are terribly clever, and I couldn’t do what they do in a thousand years. But this stuff is hard too. Teamwork. Mentoring. Communication. Understanding what to do where, with whom, and how. These skills are incredibly valuable. Recruit to and reward technical skills, by all means. But ask yourself if you want more from your team members.

SQL versus analytic tools

From a tweet from my NHS-R buddy John MacKintosh

“Two schools of thought :
1. Do as much as possible using SQL (joins/ munging/ calculations and analysis) so that your analysis is portable between different BI / Viz tools
2. Do the basics (e.g. joins and simple aggregates) in SQL, calcs and analysis in front end tools”

This is a great question, and I don’t mean to detract from its greatness when I say it is over simplistic. It’s an important question, and I have a lot to say on the subject, so much so in fact that I’m going to answer a tweet with a blog, which is very on brand for me πŸ˜„.When we think of data scientists and data engineers, we tend to think of data engineers providing beautifully normal and rapid data structures and data scientists wrangling them with a mix of SQL and R/ Python/ whatever. But in my experience it doesn’t really work cleanly that way, and nor should it.

An important factor is the way that the data is organised and documented. People in the NHS often get upset about all the different datasets that we use. A common complaint is that they don’t link up easily. People who don’t work in data think that the Trust has this big database of everything, and you just go and look up the thing you’re interested in, and it’s right there indexed against every other piece of data, and off you go. This notion is hilarious to anyone to anybody who actually works with data.

It’s not because the data is bad or because it’s not being looked after properly. It seems to me that databases, far from being perfect platonic ideas about the trust are in fact opinionated. You tend to find that they do the thing that they were originally designed to do very well. The payroll database is really good at paying people. The EHR is really good at, well, being an EHR. Data scientists, by our very nature, always want to do something that the database was not designed to do. That almost seems to me to be an actual definition of data science. Taking a database and making it do something else. If your database has already done what you’re doing, chances are you don’t need a data scientist. Just stick one of those auto ML things on it or point PowerBI at the cube. Boom. Instant insight.

So what does this all mean for the question? It means that the data engineers and data scientists are on the same team. They’re not just throwing stuff over the wall at each other. The data scientists are customers of the data engineers, and we couldn’t do a thing without them, but we can be smart customers. We might want the data in a different form, sure. We might prefer it if some of the joins were done in the backend to save us a bit of work and to help us all work together. But we can give stuff back. We might use an algorithm to predict the risk of rehospitalisation at 28 days, say. Once we’ve done that we can probably use that calculated value more quickly if it’s productionised in the DB. And if it is productionised in the DB that means everyone can access it- not just the data scientists but also the data engineers and all of their customers.

Data scientists and data engineers working together can take a database that does one thing and turn it into a database that does two things. And they can turn that database into something that does three things. And the whole time the data engineers are keeping it all fast, scalable, legible, and accurate. It’s early days for us, working this way. I’ve read about it many times, but we’re working with our friends in data engineering and I hope one day to tell our story and how we’ve all worked together to produce lots of insights and delivered them all around the Trust.

To answer the question, then, I would say that if you want to go fast, use analytic based tools, and if you want to go together take the time to port all of the insight to SQL based tools. And given the state of analytics in the NHS as it stands today I would recommend we all go as fast and as together as possible. I know we at Nottinghamshire Healthcare are.

Building generic Shiny applications with a data module

We’re rebuilding our patient experience dashboard at the moment, partly to incorporate some of the work that we’re doing on text mining and partly to connect it up to other types of information we have like staff experience and clinical outcomes. It has to be reusable because we’re using it as the basis of the text mining dashboard that we’re building for any provider trust to use with their friends and family test data. We’re trying to make everything reusable anyway partly because the different bits of the NHS should all cooperate and produce stuff together that everyone can use and partly because we’re realising that when you make code reusable the first person who can reuse it is you, 6 months later on a similar project.
So we have a patient experience dashboard that we want to hook up to our other data in our own context, and we have other trusts who want to take what we produce and hook their stuff in to it. An obvious approach is to use modules in Shiny and have a data module at the heart of the operation, which can take account of the different datasets available in each context. The next level would be a “summary type” module which would bring together all of the inputs and outputs for a particular type of data- one for patient experience, one for staff experience, etc. And the bottom level would be submodules which would carry out the individual tasks- one module to produce graphical summaries, one to produce downloadable reports, etc. We would use the {golem} package for all the reasons that we normally use {golem} (help with dependencies, packaging, testing, and deployment)
The basic structure would look like this:

  • Data module. This module would be responsible for loading ALL data, patient experience data and any other datasets that the person deploying the application might want- clinical outcomes, staff experience, whatever it might be. This module would contain upload and download features as appropriate (users may wish to upload their own data or download processed data) but it would in the main merely draw from databases, process the data into a useful form, and prepare them to be imported into the other modules
  • Function module. Each of these modules is responsible for a particular type of data (staff, patient, clinical outcome). In some trusts the dashboard wouldn’t have access to data of all of the necessary types and these modules would not show in these contexts.
  • Sub modules. Each function module would have access to submodules which do different tasks- draw graphs, make reports, etc. Because some types of data are quite similar (e.g. staff and patient experience data) there is potential to reuse these modules in some areas.

Although this is a nice approach there are lots of different ways of achieving this and I’m hoping for a bit of guidance as to which might work best.

  1. “Automatic dictator” data module. This would be a data module that works with no instruction from the person deploying the application and which instructs all the other modules. This data module would simply attempt to load all of the data, catch any errors that result in doing so, report them to the user (in an unobtrusive format- “No staff data found”) and then instruct which of the function modules should appear. Further, it would export not only dataframes but also a set of instructions to the function modules as to how to carry out their processing. For example, there may be just one or several free text questions in a patient experience function module. The data module would determine that and would export with the dataframe a description of the dataset for the use of each function module
  2. “Automatic” data module. As above, except it would not export instructions, but just data. Function modules would be responsible for understanding the shape of the data and would instruct their submodules accordingly.
  3. “Programmable” data module. This data module would accept a list of options within the run_app() function and would therefore be set up individually within each trust. For example, the run_app() function might be passed list(patient_experience = c(TRUE, 2, 5), staff_experience = c(FALSE)), indicating to draw the patient experience function module with 2 free text questions and 5 Likert-type questions, and to omit staff experience. In practice the options list would need to be longer, but this is just to illustrate the principle. The data module would load and process the data according to the arguments in run_app(). It could then either instruct the function modules as in the dictator data module or leave it up to them as in the automatic dictator module.
  4. “Dumb” data module. This data module would be dumb. It would just load the data, process, and export it to the other modules. The function modules would be expected to react intelligently to the presence or absence of data and to make sense of the structure of the data. I don’t think this example would work well because there would be nothing to turn off the function modules, which mean there would be blank tabs in the application, but I’m including it for the sake of completeness in case it gives me or anyone else an idea at a later stage.

Most of the options, except 4, are pretty similar, but they have a lot of scope to change the overall way that the application is programmed. Leaving stuff up to the function modules is probably more flexible but it may be that all the function modules need tweaks in different trusts, which is not really what we want. Having an automatic dictator model is very attractive in that you should just be able to seal the whole thing and deploy it anywhere but the reduced flexibility inherent in that approach may make it more difficult to actually get it working in different contexts. In practice some of it is probably a bit of a grey area anyway- I don’t think you could have a completely dumb data module, and nor could you have a completely dumb function module. Even the submodules will have various degrees of “dumb”. It’s more about getting an idea of where you build the flexibility and how much control you give each module over its behaviour.
I’ve written this blog post to show it to one person whom I will now ask politely on Twitter for their opinion but if anybody else has any thoughts I would be very glad to hear them. It’s better to tweet @chrisbeeley because I don’t seem to get email notifications from blog comments for some reason.

Accessing plot characteristics in Shiny modules

It was my New Year’s resolution to blog more and that’s going really well because this is the first blog I’ve done all year and it’s nearly the end of January. Well, I suppose to be fair I did do a post on our team blog but that feels like I’m making excuses.

Anyway, never mind! This is a quick blog post because this problem I had the other day has, as far as I can tell, absolutely no Google juice at all and it stumped me for ages, even though it is very simple.

We wanted to access the height of a Shiny plot at runtime but we couldn’t get it to work in a Shiny module. I googled all sorts of stuff, “session$clientData module”, “accessing plot characteristics shiny module”, “session variable shiny module”, all kinds of things, but it just didn’t work. I figured the namespacing was probably messing it up but I couldn’t figure it out. The simple version would have been session$clientData$output_sentiment_plot_upset_width, but I wasn’t surprised that that didn’t work because of the namespacing.

When I looked at the raw HTML (which I did, in desperation) I saw that the input was actually called “output_mod_sentiment_ui_1-sentiment_plot_upset”. The mod_sentiment bit is the name spacing. So I got pretty close when I tried session$clientData$output_mod_sentiment_ui_1-sentiment_plot_upset_width. But R thought that the hyphen was a minus, so that didn’t work either. Then the light dawned


That’s it. Simple, but it took me a while, so hopefully if you have the same problem you will find this post.

Serving RStudio Connect content to logged in and anonymous users

We have our patient feedback dashboard in the open where anyone can see what Nottinghamshire Healthcare NHS Trust’s patients are saying about us. Now I’ve got a Connect licence I thought perhaps I might build another version for our staff where I put stuff that we can’t share- for example the comments of people who click the “I do not wish to share this comment” box.

But I don’t want to build and maintain two versions, that would be hideous, and I was going to put a litte

logged_in <- TRUE

at the top of the one for staff and then just swap the files round but knowing me I’ll do it in a hurry one Friday and accidentally publish the logged in version to everybody. For the benefit of those who are new at this, and indeed for those who are not that new but feel the need to double check it actually works like they think it does before they design the whole system around it (like me), it’s pretty simple.

Just publish one and make people authenticate to it, and publish one in the open. Separate apps, separate links, same code. But add somewhere something like

loggedIn <- reactive({
  } else {


Analyst, analyse thyself

I work in a small (growing) team of data scientists and we’re part of a larger analytic unit which has a focus on innovation in analytics (hence us) and public health methods. We have a pretty broad remit to do interesting and useful stuff and we focus on things that we think are useful (for example, applying data science methods to problems relating to equity of access to services) and things that people ask us to help with.

I’m getting more and more interested in making sure that we offer a good service to everyone across my Trust and that we work on truly useful projects rather than just stuff that comes to us through the networks of staff that we already know (I’ve worked in my Trust for nearly 20 years so I know quite a few people).

We’re all resistant to the idea of having a bureaucratic process to decide what we do because quite often we don’t even do that much, we just give a bit of advice and show them some of their own data and they’re off again, so I think what I would like to aim for is transparency. We won’t audit the stuff coming in but we’ll show what comes out and then people can judge for themselves.

And then I thought “Why am I bothering with all that? Surely this job is about helping people and getting stuff done, not collecting data”. And then I realised a millisecond (as my eldest would say) later that that is exactly what clinicians always say about their data collection! And we make the same arguments back to them that I just made to justify it in the first place. We collect data to make sure that what we are doing is effective and to make sure that it is being distributed fairly and works for everyone.

This is a long way of saying that we’re now thinking about how to capture our activity and present it in a way that follows all the usual rules of data collection- not burdensome, timely, accurate, transparent, all that stuff. I suppose really we need a system that’s transparent to everyone- our partners in the region and the general public, so hopefully I’ll be back with data (unless I get in trouble for even suggesting such a thing, in which case forget I said anything πŸ˜†)

Shiny modules for beginners

I’ve seen some discussion on the Internet about whether people learning Shiny should start with modules or if they’re an advanced topic. I’m not going to link to any examples because this post is definitely not me saying- hey look at this rubbish over here, here’s why it’s wrong. I’ve seen a couple of posts talking about it. But I just thought I would chip in my perspective.

I really love modules and they solve a lot of problems, and I think they could be a great thing for a beginner to learn. No reason at all why beginners can’t learn them. The only caution I would give those learners is that some things can be weird or fiddly with modules. So if you’re learning at your own pace, and want to do everything really neatly and use industry standards, great, go for it. But if you want to throw up a Shiny application to impress your boss (for example) and learn by doing then you may find yourself going down some very long, dark rabbit holes trying to get the particular thing you want working. (and yes, I am writing this post now because I have been down a few myself recently πŸ˜€)

As I say, purely my perspective generated from sitting at my keyboard and interacting regularly with maybe five people learning Shiny. Your mileage may vary

Getting started in data science for healthcare

I just recruited to a data science post at Nottinghamshire Healthcare NHS Trust and I’ve been asked for feedback by several people about how to get started in data science in healthcare, and indeed how to get a job in data science in healthcare. I will also be recruiting another data scientist for our team in a few months. This post is designed to give you a feel for what we were looking for in the job that I just recruited to, as well as to the job that is coming up. They’re slightly different skill sets as I will discuss.

I should say at the top that I am pretty much a person on the Internet, there are people much better qualified than I amΒ to answer this question, and there are much better jobs out there than the ones I have advertised and will advertise. My trust is a great place to work, and we have a lot of freedom to innovate, and we are big believers in using and publishing open source code. I hope we’re a friendly and supportive team too. But I’m sure there are many out there that do all this and more, and better. With all that said, let’s discuss what I was looking for last time, what I’m looking for next time, and what you can do to get the skills for these or any other similar jobs.

The recent post I advertised was for a band 7 data scientist, one year contract, starting salary of Β£38K. It’s for a funded piece of work designing algorithms that can read patient feedback and tell you: what it’s about (parking, food, communication…), how positive or negative it is (you saved my life, this service is a disgrace, the ward was dirty, the food was cold…) and making a dashboard so people can interrogate their data. I don’t want to say too much about it here, otherwise this post will get way too long.

The first thing to say is that the field was VERY strong. I didn’t think we could get quite as much interest from quite as many talented people as we did. We were definitely spoiled for choice but the areas that ended up being very important were.

  1. Machine learning. Because it’s a one year project, with a specific goal, we were looking for someone who could just run with a machine learning project and could build a pipeline with real data
  2. Team working. The ability to help more junior staff, and to work with non technical staff is absolutely vital in such a small team. We were looking for people skills as well as “technical” team working skills- version control, package management, documentation. Avoiding “it works on my machine”, basically
  3. Communication. Some people are completely on board with data science and trust experts to just get on with it and can engage them about what they want and ask the right questions. And some people will sit with their arms folded (literally or figuratively) and wait to be convinced. And a strong organisation needs both. Cheerleaders and critics. And we were looking for someone who could navigate a meeting with a healthy dose of each, as well as someone who could engage the straight up disengaged

If you want to get good at this stuff, there are lots of things you can do. This book is excellent if you want to go the Python route. Kaggle is a great resource for datasets and how-to’s. Try to get some real data in your workplace and do some ML with it. Be a mentee. Be a mentor. Work with your team to get better practice around version control and code style. Sell it to your bosses. Even if they’re idiots and don’t listen, it’s all experience. Come to the interview and say that. “I identified several areas of data and analytic practice that could be improved by x, y, and z. I wrote them a five point plan explaining the training their staff would need, the likely benefits and pitfalls. They are idiots and didn’t listen.”

If you already work in the public sector I highly recommend the data science accelerator. Twelve days, one a week, with a data science mentor. I was a mentee a year and a half ago. Then I was a mentor. Doing one or both is a really good way of getting out of your silo, seeing new datasets and challenges, and learning to ask and answer the right questions. Besides my PhD it’s the best thing I ever did, career wise.

The next job is going to be a bit more general purpose. The team has quite a few pieces of smaller paid work coming in and I need somebody who can help out as well as work in the Trust building on the work we’re already doing: statistics, machine learning, and publishing either as a document or a Shiny powered dashboard. We’re looking for similar stuff as last time, but not as focused on machine learning, perhaps someone with some stats knowledge- regression, experimental design, measures of association, all that stuff. I feel like R would be a big advantage, but if somebody can do all that in Python then that’s good too, and ditto other things like Julia. They would probably end up learning some R just to read the stuff we’re all writing, and we’d all end up learning Python/ Julia just to read their stuff. All to the good.

Again, we need someone who can work in a team (no hero coders) and communicate at all levels of the Trust and beyond. We need someone who can teach, and learn, and someone who can convince the sceptical and engage the disengaged. Something I would absolutely love to be able to recruit to is Linux server maintenance experience. I’m in charge of two Linux servers that do our work for us, one in the cloud, one behind the firewall (see this blog, passim) and I would love to be able to give the keys to somebody who knows what they’re doing. Even more, I’d love for them to do the next upgrade. Going to RStudio Connect with Ubuntu 20.04 was hideous (not the fault of RStudio or Ubuntu, obviously, purely my fault πŸ˜€) and having someone be able to worry about that would have been wonderful.

Some people have told me I’m wasting my time managing my own servers as a data scientist and I really need to get proper IT support, and maybe they’re right, but I’ve come this far, even if I am doing it wrong. I’ve been learning Linux server stuff on my own server for 7 years. If you want to learn this way without buying your own cloud server then just buy a Raspberry Pi and ssh into it from behind your firewall at home. Set up a LAMP stack, set up WordPress, set up the free versions of RStudio Server and Shiny Server, run plumber APIs, run PHP, write a Django application, whatever. You’ll get in the most hideous mess and tear your hair out for entire weekends and you’ll look back at those weekends fondly and be glad of all the learning you did that day πŸ˜€. And if you get in a complete mess just wipe the SD card and start again.

As I say, this is all caveat emptor. Your mileage may vary. But if you want my opinions then these are they. Keep an eye on my Twitter if you’re interested in the job, and feel free to @ or DM me, or indeed send an email, my address is in my Twitter profile.