The Great Survey Munge

As I mentioned on Twitter the other day, I have this rather ugly spreadsheet that comes from some online survey software that requires quite a lot of cleaning in order to upload it to the database. I had an old version written in base R but the survey has changed so I’ve updated it to tidyverse.

And this is where tidyverse absolutely shines. Using it for this job really made me realise how much help it gives you when you’ve got a big mess and you want to rapidly turn it into a proper dataset, renaming, recoding, and generally cleaning.

It must be half the keystrokes or even less than the base R script it replaces. There are some quite long strings in there, which come from the survey spreadsheet, but that’s all just cut and paste, I didn’t write anything for them.

I love it profoundly, and I bet if I was more experienced I could cut it down even more. Anyway, here it is for your idle curiosity, it is obviously of no use to anybody who isn’t working on this data and to be honest there are bits that you probably won’t even understand, but just a quick look should show you just how much I had from the wonderful tidyverse maintainers

Rapidly find the mean of survey questions

Following on from the last blog post, I’ve got quite a nice way of generating lots of means from a survey dataset. This one relies on the fact that I’m averaging questions that go 2.1, 2.2, 2.3, and 3.1, 3.2, 3.3, so I can look for all questions that start with “2.”, “3.”, etc.


survey <- survey %>% bind_cols(
  map(paste0(as.character(2 : 9), "."), function(x) {
    return_value <- survey %>% 
      select(starts_with(x)) %>% 
      rowMeans(., na.rm = TRUE) %>%
    names(return_value) <- paste0("Question ", x)

This returns the original dataset plus the averages of each question, labelled “Question 2”, “Question 3”, etc.

Converting words on a survey dataset to numbers for analysis

As always, I have very little time for blogging (sorry) but I just came up with a neat way of converting “Strongly Agree”, “Always”, all that stuff that you get on survey based datasets into numbers ready for analysis. It’s automatic, so it will play havoc with any word based questions- analyse them in a separate script.

Here it is


survey <- map_df(survey, function(x) {
  if(sum(grepl("Agree", unlist(x), = TRUE)) > 0) {
      x == "Strongly agree" ~ 10,
      x == "Agree" ~ 7.5,
      x == "Neither agree nor disagree" ~ 5,
      x == "Disagree" ~ 2.5,
      x == "Strongly disagree" ~ 0,
      TRUE ~ NA_real_
  } else if(sum(grepl("Always", unlist(x), = TRUE)) > 0){
      x == "Always" ~ 10,
      x == "Often" ~ 7.5,
      x == "Sometimes" ~ 5,
      x == "Rarely" ~ 2.5,
      x == "Never" ~ 0,
      TRUE ~ NA_real_
  } else if(sum(grepl("Dissatisfied", unlist(x), = TRUE)) > 0){
      x == "Very satisfied" ~ 10,
      x == "Satisfied" ~ 7.5,
      x == "Neither satisfied nor dissatisfied" ~ 5,
      x == "Dissatisfied" ~ 2.5,
      x == "Very dissatisfied" ~ 0,
      TRUE ~ NA_real_

  } else {

Glorious. R makes it too easy, really, I think, sometimes 🙂

Decade round up post

I’ve got into the habit of writing a yearly roundup blog post (see this blog, passim), based on a suggested framework by David Allen. Since this is the end of a decade I thought it would be fun to do one for the whole decade.


Regular readers will know that as far as physical goes I have spent most of the decade getting progressively sicker with two different diseases, beating both, and then having complications from the liver transplant that saved me from the first one. Perhaps the pinnacle was having a colectomy/ splenectomy in September 2017, bleeding a lot and being very sick for several weeks afterwards and then running a marathon in 3 hours 44 minutes just 18 months later in February 2019. I went on to win silver in the 5K in the British Transplant Games in July of this year, but am now not currently running due to complications from my liver transplant. I’ll be fine, I need to go and see the doctors next year, they’ll sort me out.


I started meditating in 2019 to help me to manage my feelings about being ill. I came to rely emotionally on running and being successful at running and I found that when that was taken away it was hard to cope. I’ve been meditating regularly and have been on a few day long meditation retreats and it’s made me a better person, particularly a better dad, since it helps me to manage my emotions when the kids are playing up. I now regularly attend a Buddhist temple and am trying to dedicate myself to practising compassion for all living things.


I’m still pretty hopeless with money, and we spent a lot on the bathroom fairly recently. I’ve managed to save up a bit of money and I’m currently trying to save three months salary because I gather that’s the minimum you’re supposed to have. I hope that by about 2022 I will be debt free and have three months salary tucked away.


This decade my family changed an awful lot, because I had two kids, one in 2011 and another in 2014. I found myself an accidental attachment parent, in fact I wrote about it here if you’re interested. I have two boys, they fight a lot and one of them is extremely defiant and can really press my buttons. They are and will always be my greatest creation, and they’re both kind, intelligent, and funny. I’m pretty confident already the world is a better place with them in it, and I’m sure they’ll continue to make me proud as they get older.


As far as work goes this decade has been absolutely transformative. In 2010 I went on an R course. I used SPSS for my PhD (and some weird thing to do mixed effects modelling, I forget what it was called) but I went on a course teaching regression methods and they introduced R almost incidentally as a tool to do regression analysis. I was working on a patient survey at the time and we were using Excel macros to draw the graphs (me! using Excel macros! Don’t tell anyone, I’ve got a reputation to protect 😉). It was very time consuming and I realised that I could program R to dump all the images to a folder, and then we put them together by hand in Microsoft Word. This took a week! A whole week! Then I realised I could use odfweave (now orphaned to put the whole document together. This really turned heads. This job used to take a week and now I could do it in a few hours (a little tidying up of the formatting was required). And then in 2013 everything changed again because my Trust won some money from an innovation fund and we decided to put all of patient experience data online in an interactive dashboard. As always, when I agreed to this I had no idea how to actually do it and I hadn’t heard of Shiny. While I was working on it I heard about Shiny and starting prototyping in it, reasoning that I would need to make it into something “proper” later on. In fact, Shiny was more than equal to the task and I became one of a fairly few people at the time who were using Shiny in production. I used an early version of Shiny (maybe 0.6) and one of the early versions of Shiny server. Because I was one of a few people I ended up being approached by a publisher to write a book about Shiny and I have now written three editions of it. Now my work is very focused on Shiny and when the people who have heard of me think of me they tend to think Shiny.
The other big thing that has changed the way I work which is much more recent is the data science accelerator. The data science accelerator is a 12 day mentorship programme and it’s the best thing I’ve learned on since my PhD. I would highly recommend it, there are more details here I’m still in touch with my mentor and he was and is incredibly helpful, supportive, and knowledgeable (thanks, Dan, if you ever see this!). I am in charge of patient experience in my Trust and I work closely with the Involvement team who do a lot of work making sure the data is collected and read by the right people and that those people do something about it. They’re a great team and they’ve won many awards for what they do, you can find out more about them here (this site will give a scary certificate warning and then redirect you at the moment, we’re between websites, but that’s the link that will still work in 6 months’ time). Anyway they’ve had a big influence on me in that they are not very interested in the ticky box scores, they’re mainly interested in the comments. So I’ve spent a long time trying to build something that they would want to use, and it’s a machine to help them read all the comments. I call it the recommendation engine for patient feedback and you can read all about it here It would never have occurred to me to build it if it weren’t for the Involvement team. Anyway, the recommendation engine is incredibly difficult and ambitious and will take 5 years or even more but I got started with text mining on the data science accelerator. And now I am very close indeed to securing funding to spend a whole year building a text mining application which will:

  1. Automatically tag comments by theme
  2. Automatically tag comments by sentiment
  3. Use document similarity to find similar comments
  4. Tie it all together with a nice Shiny dashboard

This is incredibly exciting and I owe it all to the data science accelerator. I can’t talk about the project in detail yet because we don’t have the funding but once we do I’ll be back to talk about it. I’ll be recruiting a new person hopefully in April, so look out for that too if you’re interested.

The future

In the next decade it will be more of the same, really. I would like to build the biggest and best data science team in any provider Trust in the country, win a medal at the world transplant games, and watch my boys grow into the talented, unique, and kind people I know they will become. That should keep me busy. Can’t wait to do a roundup in 2030 and see how I got on 🙂

My superpower (a talk I gave about being ill)

So this post is nothing to do with R, or Linux, or statistics, or any of the usual stuff. It’s about me. It’s more than possible that you’re not very interested in that so consider yourself warned.

Long time readers will know I’ve had some pretty serious health problems over the years. I haven’t really talked about it much on here. To be honest I was frightened that people might be put off working with me if they thought I would be off sick a lot and I was also frightened at one point that I might actually be permanently unable to work if the liver disease wasn’t treatable. So I avoided talking about it with the wider world, although the people I directly work with know all about it (it’s hard to miss, really, when someone comes to work with end stage liver disease).

Anyway, I’ve since realised that I’m not really helping other people in the same situation being quiet about it, and actually I needn’t have worried and I don’t think my illnesses have put anybody off working with me at all. I’ve missed a few months here and there having surgery, and I was quite unproductive while I waited for my liver transplant, but my employer and others were very understanding and it all worked out in the finish.

So this is my story, condensed. Now seems like a good time to add that actually my new liver is not working very well. I was pretty ill a few months back and have been in hospital twice this year with it, and it looks certain that I will at some point in the medium term need another liver transplant. I’m not going to hide it away like I did last time, I’m going to be open about it and hopefully other people who might be going through something similar for the first time might feel better about their own situation if they see me doing it.

Someone I work with did an absolutely wonderful post about working and being ill, and perhaps I’ll do one at some point, but this will do for now. Actually now I think about it I probably need to stand up and be counted as one of the many people all over the world who have a stoma because I know some people find it very stigmatising when they get one and they hide it away, so perhaps I’ll talk about that another time too. Mine is very often hanging out the bottom of my jumper and stoma bags are absolutely nothing to be ashamed of. I’m actually rather proud of mine since getting it nearly killed me, but that’s a story for another time.

Sorry about the sound quality, I have made a start on subtitles but it takes absolutely ages, I have turned on community uploads if anybody feels like giving me a hand with it.

NHS-R conference

So I recently just got back from the NHS-R community conference, which was amazing of course, and it’s got me in the mood to share, so I’m writing a few blog posts. I’ve got some more in depth stuff to say about where I think NHS-R is/ should be going, but this is the “feels” one.

As I mentioned on Twitter, I love the NHS and I love R so I’m obviously going to enjoy the NHS-R conference. It was really good and the standard of talks and workshops was really high. Much bigger and better than last year.

It was so inspiring to be there. To be honest I get a bit depressed about the state of analytics in the NHS (as you’ll see from my Twitter) but you would never guess it for those two days. The big feeling that I got while I was there is that I need to up my game. There were people with huge, fully functioning packages, multiple academic publications from their R work, all sorts of complicated machine learning that was beyond my ken, big complicated statistical models, you name it really.

It’s such a wonderful feeling, to escape the inane drudgery of idiotic performance reporting and looking at my seven millionth SPC and being asked if three points of increase is a trend, and to be in the company of people who had done truly remarkable things with R. Remarkable things, I should add, even with the dead hand of bureaucracy and idiotic IT security systems, and all the stuff that I have to do. When Google do cool things, it’s easy to just think “Well, that’s Google. I couldn’t do that where I am”. These people had done stuff inside the NHS and they had the scars to prove it!

I think probably I will do a Shiny workshop at the NHS-R conference every year until I die, but I might try to sneak a paper in there as well actually next time. More on this subject once the work flowers a bit, I hope.

Authenticating using LDAP from Active Directory using Shiny Server Pro

Right, I promised I would write this a very long time ago and I still haven’t done it. I’ve just got back from the NHS-R community conference (see an upcoming post) and I’m in a sharing mood, and besides someone just asked me about this on Twitter.

So setting up LDAP based authentication on Shiny Server Pro via Windows Active Directory. Do note that this post is about using Active Directory. If you’re not doing that a lot of this stuff will not apply.

I made the most most horrible mess of this when I tried, because quite frankly I didn’t have the slightest idea what I was doing. I’m sure your setup will be different from mine, but I can offer some general advice. Do read the documentation as well, obviously, this is only going to fill in the blanks, it’s not going to replace the docs.

The first piece of advice I can give you is don’t try to get it working all at once in the configuration file. Unless you’re some sort of Linux security genius it’s not going to work. Instead use the ldapsearch command to make sure that you’ve got the right server and your search string is correct. You’re going to need the help of IT who can tell you a bit about what the search string might look like. To give you a clue, here’s what I finally got working:

ldapsearch -H ldap:// -x -D “,OU=Group,OU=Group2,OU=Group3,DC=xxxx,DC=xxx,DC=nhs,DC=uk” -b “OU=Users All, DC=xxx,DC=xxxx,DC=nhs,DC=uk”  -w password “(&(cn=test.user))”

You can see the location of the LDAP server at the beginning, I had to take the details out because I was told that publicising the server name poses a security risk. That first CN=User.Name is the LDAP user. You can just use yourself or IT can set you up a special LDAP login (you will probably once to switch to this once you’ve got everything working). Then in that string the next bit is where the user is in terms of their place on the tree and the DC bit is the domain and subdomain split up into bits. Again I’ve redacted it. Imagine it might say something like DC=mytree, DC=myorg, DC = if the LDAP server was on

Then the string after, the bit that says “OU=Users All…” shows where the person you’re searching for in the tree is. You can search for yourself, or someone else. There may be more than one OU group, depending on what IT tell you, and then the DC bit is the same as it was in the previous bit. Then you need -w password (this is yours or your LDAP user’s password) or if you don’t want to put your password into the terminal use

-W which will prompt for your password. And the last bit as you can probably guess is the username of the person that you’re searching for. Once it works it will bring back a big list of all the groups and stuff that the user is registered in.

So now you’ve got that working you can edit /etc/shiny-server/shiny-server.conf to add the authentication stuff. You probably already did something with it setting up application directories etc.
The thing that I got working was as follows (again with bits redacted, I’m afraid):

auth_active_dir ldap://,DC=xxxx,DC=nhs,DC=uk{

  base_bind “,OU=Location,ou=GroupXXX,OU=GroupYYY,{root}” “LDAP_user_password”;
  user_bind_template “{username}@xxx”;
  user_filter “sAMAccountName={username}”;
  user_search_base “OU=GroupXXX”;
  group_search_base “OU=GroupXXX,OU=GroupYYY”;

Do have a look at the examples in the documentation as well, it will help you to understand what yours might look like. The base_bind line has the actual user name in the first bit, e.g. CN=chris.beeley or CN = ldap.user, and the actual password at the end. This is not very secure. Your connection is also being made over LDAP, not LDAPS. Also not very secure.

Let’s fix both those problems now. Add an “s” to the server name at the beginning (ldaps://…). Then you’ll need to point to a certificate on the trusted_ca line. IT should be able to give you one. Note that it’s a ROOT certificate you need. IT gave me loads that were not root certificates which did not work. I ended up downloading it myself from the Intranet through Chrome which messed up the domain name and I never quite resolved it to be honest. It works, that ended up being enough by the end. Note also that the certificate must be in PEM format.

auth_active_dir ldaps://,DC=xxxx,DC=nhs,DC=uk{

  trusted_ca  /…pathtodir/certificate3.pem;
  base_bind “,OU=Location,ou=GroupXXX,OU=GroupYYY,{root}” “/pathtopassword/password.txt”;
  user_bind_template “{username}@xxx”;
  user_filter “sAMAccountName={username}”;
  user_search_base “OU=GroupXXX”;
  group_search_base “OU=GroupXXX,OU=GroupYYY”;

You can see also that we’ve written the file path to a text file containing the password on the second line. It’s a good idea to modify the permissions of this so only root can read it, to keep it safe.
That’s it! You’ve done it! Well done! That took me weeks.

Wait! One more thing. You now have your staff putting their actual usernames and passwords into a web form over HTTP. This is VERY insecure. We need to secure over SSL. This is pretty easy. You just need a key and a certificate for the server from your IT department. Tell Shiny to listen on 443 (instead of, say, 3838) as below and put the key and certificate file path in. So the top of your config will look like this.

server {

  # Instruct this server to listen on port 443, the default port for HTTPS traffic
  listen 443;

  ssl /etc/shiny-server/KEY.key  /etc/shiny-server/certificate.cer;

… etc.

Your users can now connect at https://serveraddress

That’s really it now. Well done. Have a cup of tea to relax and celebrate a job well done. I apologise that this post is a little light on details, it will be very different I think depending on where you are, but if I’d read this post before I’d started it would have helped me, and I hope it helps you too.

Add label to shinydashboard

I feel like I’m sticking my neck out a bit here, and there’s a simple way to doing this that I haven’t found, but I’ve looked pretty hard and “add label to sidebar shiny dashboard” has basically no Google juice at all, and I should know because I’ve been staring at it for half an hour.

Sometimes you want to add a simple, static label to a shinydashboard sidebar. If you just add it with p() it isn’t aligned nicely with the rest of the sidebar controls.

You can add something dynamic with sidebarMenuOutput, and you could do that as a long way round, but I got to thinking that there must be a simple way of doing it. I ended up looking at my shinydashboard in developer view in Chrome, and just stealing the markup from there. Once you’ve done that it’s very simple.

div(class = "shiny-input-container", 
      p(paste0("Data updated: ", date_update))

This code just shows a pre-calculated value of the date when the data was updated. I stole this idea from a colleague because sometimes the cron job that updates the data chokes and nobody notices for a while.

The analysts’ manifesto

I was at an event a little while ago and there was talk of change coming for healthcare analysts. With the advent of population health management we were going to finally get the recognition we as a profession deserve and would get the training and tools necessary to deliver the improvements we all know that data can give.

I put my hand up and said, in essence, we’ve heard it all before and I’ll believe it when I see it. I’m tired, frankly, of reading policy documents that lay out all these amazing new ways of working that just never happen because we’re not given the time and space by senior management and have to spend our time running around sending all this pointless data to all these different bodies that a lot of the time seem to just throw the data on the pile and then forget about it.

To the speaker’s great credit, they accepted my challenge and I spoke to them later on. We agreed that we would work on a manifesto for healthcare analysts, with me gathering submissions from the analytical community and coordinating it into a document.

This is the first draft. I invite anyone and everyone to criticise, challenge, build on, edit, rewrite, whatever, what is here. This is nowhere near the final version but I thought I would get it out there now and see if anybody has any reactions. I’ve been working with several other analysts so it’s not all me but the writing is and you can blame me for the strident tone.

The analysts manifesto

There’ll be more of this in the months to come, we’re going to take it out on the road, so to speak, and in particular I want leaders to see it. Too often this stuff is handed round healthcare analysts at the coalface who already know this stuff. I want our leaders to know what we need and what we can do, and this is the first step.

Python, working together, and productionising code at Nottinghamshire Healthcare NHS Trust

Someone just asked me a question on Twitter and I was most of the way through what would have been several tweets before I thought perhaps it would work better as a blog post. I’m going to answer the question first, and then talk some more about the context and what I’m hoping to achieve where I am and (with the advent of more collaborative working through the ICS) in the wider system. So I’ve been using scikit-learn to train a classifier to predict non attendance at healthcare appointments and the question to me was “What made you choose Python rather than R for this one? Which classifier are you using?”.

I’m using a couple of classifiers, to be honest I’m really just messing around but I always like to tweet and blog at the beginning because you end up having useful conversations that can save you time later on. I’m sort of aping this excellent paper and have got some pretty good preliminary results using gradient boosting, but as I say I am just kind of messing around at the moment.

The reason I’m using Python instead of R is twofold. The first reason is that I have been using Python more and more doing text mining (for a very brief outline that I will come back to see this blog post). Python is a little bit ahead in terms of the text mining algorithms, because that’s what people are using, and for a while I would do all the data cleaning and outputs (think Shiny) in R, just nipping into Python for four lines to run the model. The problem with doing that is that my Python is so bad that I often can’t even make the model run. Or even if it runs I can’t necessarily get the outputs or plots that I need. So I’ve realised I just need to rip off the plaster and learn Python. Properly, end to end. So when this predicting DNA with machine learning thing came along it seemed the perfect project in which to build a Python pipeline. I could do it in R in a fifth the time, I’m sure, but I wouldn’t be learning to use Python to help with the text mining stuff, so I’m being discplined and torturing myself learning this new language (it’s hard work but it’s not torture, obviously, I’m having a great time).

There is a bigger reason, however, which is what motivated this blog post. People in the system are using Python. The people who wrote that paper cited above used Python and they’ve indicated their willingness to look at sharing the code once they’ve tidied it up a bit. Some other people doing something else I haven’t talked about yet (but will) are also using Python. I had a bit of A Moment the other day when I realised that it’s no good just building reusable, sharable technology. That’s only the first step. In a way, that’s the easy bit. Easier bit. We need people out in the system who can actually use it. Can you just go to any provider Trust in the country and say “Hey, I fitted a machine learning to predict DNA with scikit-learn, here’s the code/ a Docker image” and they’ll just say “Thanks very much, we’ll productionise it in the data warehouse next month”? Nope! And they definitely won’t say “That’s very interesting, we’ve compared it with some of our internal testing and we’ve improved the model, we’ve already submitted a pull request”.

We have a saying where I come from, the Involvement team, I’m sure they got it from somewhere but I heard it from them first. “If you want to go fast, go alone. If you want to go far, go together”. Learning Python for me is just a tiny step towards me being part of the solution to the problem of “We have all these amazing machine learning models ready to rock out in the wilds of the NHS, but who on Earth is going to get them set up and improving lives/ saving money?”

As with quite a lot of my blog posts at the moment, I am aware that I’m talking about tackling huge, systemic problems, and I’m in no doubt that I’m the absolute tiniest cog in the whole system, so it does feel a bit weird talking about all these huge, system level changes. But be the change you want to see, that’s what I always say, so I’m doing my best.