Productionising R at Nottinghamshire Healthcare

I’m hopeful we’re moving into a bit of a new phase with using R in my Trust so I thought I’d outline the direction of travel, to see if it chimes with anyone else and just to keep people up to date about what we’re doing.

We’ve used Shiny for some years now, maybe 7, and we have applications behind the firewall and in the cloud which are well used by the staff who need them. We’ve also been building our skills (R and data ops, which is a whole other post) and I’m hopeful that we’re ready for the next phase, which I would call the “productionise” phase. There are two main tasks to achieve before we’ve really got the work that we’re doing nicely embedded in everyday practice at Nottinghamshire Healthcare.

The first thing we’re looking at at the moment is using the data warehouse better. We have a very large and well featured data warehouse and it does loads of really great stuff but we (and I’m the worst offender for this) are still relying too much on pulling stuff out of it to say, Excel, and then reading that into R and off we go with an analysis. It’s a lot of work getting it out and then you can’t really communicate with the data warehouse people easily about what you did because you’ve introduced this whole other layer on top messing around with Excel sheets and you can’t easily tell them what you did. So the first bit for me (and this is not revolutionary in any way, we’re just building up to something) is to use the {odbc} and {pool} package, with shiny, to directly interface with the data warehouse, do the analysis, deploy it as a Shiny application, and just have it living live on the data warehouse. It’s striking straight away that when you do that suddenly you’re talking the same language as your BI team and you can communicate to them what you’ve done and give them to tools to reimplement it themselves if they want to.

And the second bit we need to get right is productionising the Shiny applications themselves. I’ve been looking at the {golem} package to do this. Golem really appeals to me for two reasons. Firstly, because it provides a robust framework for modularising code. I’m starting to realise that I do write the same Shiny code over and over again. Something in particular that I have done a lot is write code that makes sense of a spreadsheet. Ultimately I would like to write a module that can take an incredibly messy spreadsheet and with a few clicks from the user tidies it up ready for further processing. And with the right level of modularity I could use that over and over again. I did start writing one a little while ago but I got distracted.

I’ve got a few projects coming up that exemplify the process and an abstract submitted at the R in medicine conference so hopefully I’ll be back to say more about this all in due course with some real examples. In the meantime there’s some bare bones golem code here and here but it’s early days so please don’t judge me πŸ™‚

app.R and global.R

I’m doing some Shiny training this year and I want to teach whatever the new thinking is so I’ve been reading Hadley Wickham’s online book Mastering Shiny. There’s a couple of things that I’ve noticed where Shiny is moving on, so if you want to keep up to date I suggest you have a look. I’m going to pick out a few here. Firstly, note that in Shiny 1.5 (which is not released at the time of writing) all code in the R/ directory will be sourced automatically. This is a very good idea, I’ve got loads of source(“useful_code.R”, local = TRUE) lines in some of my applications, so it gets rid of all that.

Something else that’s different is modules are working slightly differently, which I haven’t bothered to absorb yet, I’ve enough on my plate with {golem} but if you’re using modules I suggest to keep abreast and have a look at the section on modules.

The thing that has just pricked up various ears on Twitter, though, is the lack of an option on the RStudio option to create separate server.R and ui.R files when creating a new Shiny application through the wizard. I was very surprised when I noticed this. And indeed if you look at Hadley’s book there is no mention of server.R and ui.R, it’s all just app.R. He suggests that if you are building a large application you hive off the logic (in Shiny 1.5, in the R/ folder, or for now by calling source(“file.R, local = TRUE).

But then straight away you’re wondering about where that leaves global.R. For the uninitiated, global.R is a separate file that’s really useful when you’re deploying to a server because it is run only once when the application is run the first time and then will serve its contents to any applications. Its contents is also available to ui.R, which can be helpful setting up the UI based on what’s in the data.

As I mentioned I want to teach where the world is going so I’m trying to do things the new way (I have never taught app.R because, honestly, I hate it, but I guess with the new source R/ folder I can see the reasoning, and I’m not going to argue with the folks down at RStudio anyway πŸ™‚ )

I wasn’t sure about where app.R fitted in with global.R and running on a server so I have written a test application. You can see the code here, I’m sorry it won’t run because you don’t have the data and being honest I can’t be bothered to deploy it properly somewhere where you can get it but you get the idea.

The first time the code runs the Sys.sleep(10) runs and you get a big pause. But, sure enough, when you go back it doesn’t run and you get straight in. You can see also that the contents of the datafile are available to the UI (choices = unique(ae_attendances$Name). Lastly, take my word for it that if you add a file in to the folder called “restart.txt” then it will rerun (and generate the 10 second pause) the next time you go to the application, just as I used to do with global.R.

That’s all I know at the moment. I hope this is clear and useful (and correct!), it’s all based on stuff I have cobbled together today looking at Hadley’s book and messing around.

Data science for human beings

Someone just emailed me to ask me about getting into data science. They knew all the usual stuff, linear algebra, Python, all that stuff, so I thought I’d talk about the other side of data science. It’s all stuff I say whenever I talk about data science, but I’ve never written it down so I thought I may as well blog it.

There are three things that are probably harder to learn that will make you stand out at interview and be a better data scientist.

First. Understand your data. I work in healthcare. Some of the data I work with is inaccurate, or out of date, or missing, or misspelled, or just plain wrong. It’s my job to understand these processes. There’s a saying in healthcare, that goes something like “60-80% of apparent differences in healthcare providers are related to different practices with data”

Second. You work in a team. With other data scientists, and with the wider organisation. Don’t be a hero coder, go off to your bunker and write this piece of genius that nobody understands and nobody wants. Work agile. Get buy in as you go. Mentor people. Be humble. Listen to what people want. Don’t do analysis because it’s flashy and cool. Build what your users want. Get to know them. Understand UX and UI. There’s a saying I like that goes “We’re all smart. Distinguish yourself by being kind”. Be a team player. Share the glory, and the blame.

Third. Have an opinion. Don’t just learn every method going and apply them according to whatever medium.com posts are saying this month. Scan the horizon. Find new stuff. Dig out old stuff. Think critically about the work that you and others are doing. And sometimes, when you can, go in large. So far in my career I’ve bet large on Shiny and text mining, and they’ve both paid off for the organisation I work for and for me. My latest pick is {golem}. I think it’s going to be massive and I want to be near the front of the pack if and when it is. Trust yourself. It’s your job to support your organisation with their priorities, but it’s your job to know stuff they don’t know and to push your organisation along a bit. I’ve never done anything really significant in my career that somebody asked me to do. I’ve pitched all my really significant projects, although obviously I spend most of my time building stuff people want and ask for (see point two).

NHS data science and software licensing

I’m writing something about software licensing and IP in NHS data science projects at the moment. I don’t think I ever dreamed about doing this, but I’ve noticed that a lot of people working in data science and related fields are confused about some of the issues and I would like to produce a set of facts (and opinions) which are based on a thorough reading of the subject and share them with interested parties. It’s a big job but I thought I’d trail a bit of it here and there as I go. Here’s the summary at the end of the licences section.

The best software licence for a data science project will vary case by case, but there are some broad things to consider when choosing one. The most important decision to make is between permissive and copyleft licences. Permissive licences are useful when you want to maximise the impact of something and are not worried about what proprietary vendors might do with your code. Releasing code under, for example, an MIT licence allows everybody, including individuals using proprietary code, a chance to use the code under that licence.

Using a copyleft licence is useful when there is concern about what proprietary vendors might do with a piece of code. For example, vendors could use some functionality from an open source project to make their own product more appealing, and then use that functionality to sell their product to more customers, and then use that market leverage to help them acquire vendor lock in. Vendor lock in is the state in which using a company’s prodcuts ensures that you find it very difficult to move to another company’s products. An example might be using a proprietary statistical software package and saving data in its proprietary format, making it difficult to transfer that data to another piece of software. If proprietary software companies can use code to make the world worse in this way, then choosing a copyleft licence is an excellent way of sharing your code without allowing anybody to incorporate it into a proprietary codebase. Proprietary software companies are free to use copyleft licensed code, but are highly unlikely to do so since it means releasing all of the code that incorporates it.

The Great Survey Munge

As I mentioned on Twitter the other day, I have this rather ugly spreadsheet that comes from some online survey software that requires quite a lot of cleaning in order to upload it to the database. I had an old version written in base R but the survey has changed so I’ve updated it to tidyverse.

And this is where tidyverse absolutely shines. Using it for this job really made me realise how much help it gives you when you’ve got a big mess and you want to rapidly turn it into a proper dataset, renaming, recoding, and generally cleaning.

It must be half the keystrokes or even less than the base R script it replaces. There are some quite long strings in there, which come from the survey spreadsheet, but that’s all just cut and paste, I didn’t write anything for them.

I love it profoundly, and I bet if I was more experienced I could cut it down even more. Anyway, here it is for your idle curiosity, it is obviously of no use to anybody who isn’t working on this data and to be honest there are bits that you probably won’t even understand, but just a quick look should show you just how much I had from the wonderful tidyverse maintainers

https://gist.github.com/ChrisBeeley/488c3a8fa35b57d8b40232d70e1dfdc9

Rapidly find the mean of survey questions

Following on from the last blog post, I’ve got quite a nice way of generating lots of means from a survey dataset. This one relies on the fact that I’m averaging questions that go 2.1, 2.2, 2.3, and 3.1, 3.2, 3.3, so I can look for all questions that start with “2.”, “3.”, etc.

library(tidyverse)

survey <- survey %>% bind_cols(
  
  map(paste0(as.character(2 : 9), "."), function(x) {
    
    return_value <- survey %>% 
      select(starts_with(x)) %>% 
      rowMeans(., na.rm = TRUE) %>% 
      as.data.frame()
    
    names(return_value) <- paste0("Question ", x)
    
    return(return_value)
  }) 
)

This returns the original dataset plus the averages of each question, labelled “Question 2”, “Question 3”, etc.

Converting words on a survey dataset to numbers for analysis

As always, I have very little time for blogging (sorry) but I just came up with a neat way of converting “Strongly Agree”, “Always”, all that stuff that you get on survey based datasets into numbers ready for analysis. It’s automatic, so it will play havoc with any word based questions- analyse them in a separate script.

Here it is

library(tidyverse)

survey <- map_df(survey, function(x) {
  
  if(sum(grepl("Agree", unlist(x), ignore.case = TRUE)) > 0) {
    
    return(case_when(
      x == "Strongly agree" ~ 10,
      x == "Agree" ~ 7.5,
      x == "Neither agree nor disagree" ~ 5,
      x == "Disagree" ~ 2.5,
      x == "Strongly disagree" ~ 0,
      TRUE ~ NA_real_
    ))
  } else if(sum(grepl("Always", unlist(x), ignore.case = TRUE)) > 0){
    
    return(case_when(
      x == "Always" ~ 10,
      x == "Often" ~ 7.5,
      x == "Sometimes" ~ 5,
      x == "Rarely" ~ 2.5,
      x == "Never" ~ 0,
      TRUE ~ NA_real_
    ))
  
  } else if(sum(grepl("Dissatisfied", unlist(x), ignore.case = TRUE)) > 0){
    
    return(case_when(
      x == "Very satisfied" ~ 10,
      x == "Satisfied" ~ 7.5,
      x == "Neither satisfied nor dissatisfied" ~ 5,
      x == "Dissatisfied" ~ 2.5,
      x == "Very dissatisfied" ~ 0,
      TRUE ~ NA_real_
    ))

  } else {
    
    return(x)
  }
})

Glorious. R makes it too easy, really, I think, sometimes πŸ™‚

Decade round up post

I’ve got into the habit of writing a yearly roundup blog post (see this blog, passim), based on a suggested framework by David Allen. Since this is the end of a decade I thought it would be fun to do one for the whole decade.

Physical

Regular readers will know that as far as physical goes I have spent most of the decade getting progressively sicker with two different diseases, beating both, and then having complications from the liver transplant that saved me from the first one. Perhaps the pinnacle was having a colectomy/ splenectomy in September 2017, bleeding a lot and being very sick for several weeks afterwards and then running a marathon in 3 hours 44 minutes just 18 months later in February 2019. I went on to win silver in the 5K in the British Transplant Games in July of this year, but am now not currently running due to complications from my liver transplant. I’ll be fine, I need to go and see the doctors next year, they’ll sort me out.

Spiritual

I started meditating in 2019 to help me to manage my feelings about being ill. I came to rely emotionally on running and being successful at running and I found that when that was taken away it was hard to cope. I’ve been meditating regularly and have been on a few day long meditation retreats and it’s made me a better person, particularly a better dad, since it helps me to manage my emotions when the kids are playing up. I now regularly attend a Buddhist temple and am trying to dedicate myself to practising compassion for all living things.

Financial

I’m still pretty hopeless with money, and we spent a lot on the bathroom fairly recently. I’ve managed to save up a bit of money and I’m currently trying to save three months salary because I gather that’s the minimum you’re supposed to have. I hope that by about 2022 I will be debt free and have three months salary tucked away.

Family

This decade my family changed an awful lot, because I had two kids, one in 2011 and another in 2014. I found myself an accidental attachment parent, in fact I wrote about it here https://attachmentparenting.co.uk/manifesto-its-different-for-dads/ if you’re interested. I have two boys, they fight a lot and one of them is extremely defiant and can really press my buttons. They are and will always be my greatest creation, and they’re both kind, intelligent, and funny. I’m pretty confident already the world is a better place with them in it, and I’m sure they’ll continue to make me proud as they get older.

Work

As far as work goes this decade has been absolutely transformative. In 2010 I went on an R course. I used SPSS for my PhD (and some weird thing to do mixed effects modelling, I forget what it was called) but I went on a course teaching regression methods and they introduced R almost incidentally as a tool to do regression analysis. I was working on a patient survey at the time and we were using Excel macros to draw the graphs (me! using Excel macros! Don’t tell anyone, I’ve got a reputation to protect πŸ˜‰). It was very time consuming and I realised that I could program R to dump all the images to a folder, and then we put them together by hand in Microsoft Word. This took a week! A whole week! Then I realised I could use odfweave (now orphaned https://cran.r-project.org/web/packages/odfWeave/index.html) to put the whole document together. This really turned heads. This job used to take a week and now I could do it in a few hours (a little tidying up of the formatting was required). And then in 2013 everything changed again because my Trust won some money from an innovation fund and we decided to put all of patient experience data online in an interactive dashboard. As always, when I agreed to this I had no idea how to actually do it and I hadn’t heard of Shiny. While I was working on it I heard about Shiny and starting prototyping in it, reasoning that I would need to make it into something “proper” later on. In fact, Shiny was more than equal to the task and I became one of a fairly few people at the time who were using Shiny in production. I used an early version of Shiny (maybe 0.6) and one of the early versions of Shiny server. Because I was one of a few people I ended up being approached by a publisher to write a book about Shiny and I have now written three editions of it. Now my work is very focused on Shiny and when the people who have heard of me think of me they tend to think Shiny.
The other big thing that has changed the way I work which is much more recent is the data science accelerator. The data science accelerator is a 12 day mentorship programme and it’s the best thing I’ve learned on since my PhD. I would highly recommend it, there are more details here https://www.gov.uk/government/publications/data-science-accelerator-programme/introduction-to-the-data-science-accelerator. I’m still in touch with my mentor and he was and is incredibly helpful, supportive, and knowledgeable (thanks, Dan, if you ever see this!). I am in charge of patient experience in my Trust and I work closely with the Involvement team who do a lot of work making sure the data is collected and read by the right people and that those people do something about it. They’re a great team and they’ve won many awards for what they do, you can find out more about them here http://involve.nottshc.nhs.uk/ (this site will give a scary certificate warning and then redirect you at the moment, we’re between websites, but that’s the link that will still work in 6 months’ time). Anyway they’ve had a big influence on me in that they are not very interested in the ticky box scores, they’re mainly interested in the comments. So I’ve spent a long time trying to build something that they would want to use, and it’s a machine to help them read all the comments. I call it the recommendation engine for patient feedback and you can read all about it here https://chrisbeeley.net/?p=1214. It would never have occurred to me to build it if it weren’t for the Involvement team. Anyway, the recommendation engine is incredibly difficult and ambitious and will take 5 years or even more but I got started with text mining on the data science accelerator. And now I am very close indeed to securing funding to spend a whole year building a text mining application which will:

  1. Automatically tag comments by theme
  2. Automatically tag comments by sentiment
  3. Use document similarity to find similar comments
  4. Tie it all together with a nice Shiny dashboard

This is incredibly exciting and I owe it all to the data science accelerator. I can’t talk about the project in detail yet because we don’t have the funding but once we do I’ll be back to talk about it. I’ll be recruiting a new person hopefully in April, so look out for that too if you’re interested.

The future

In the next decade it will be more of the same, really. I would like to build the biggest and best data science team in any provider Trust in the country, win a medal at the world transplant games, and watch my boys grow into the talented, unique, and kind people I know they will become. That should keep me busy. Can’t wait to do a roundup in 2030 and see how I got on πŸ™‚

My superpower (a talk I gave about being ill)

So this post is nothing to do with R, or Linux, or statistics, or any of the usual stuff. It’s about me. It’s more than possible that you’re not very interested in that so consider yourself warned.

Long time readers will know I’ve had some pretty serious health problems over the years. I haven’t really talked about it much on here. To be honest I was frightened that people might be put off working with me if they thought I would be off sick a lot and I was also frightened at one point that I might actually be permanently unable to work if the liver disease wasn’t treatable. So I avoided talking about it with the wider world, although the people I directly work with know all about it (it’s hard to miss, really, when someone comes to work with end stage liver disease).

Anyway, I’ve since realised that I’m not really helping other people in the same situation being quiet about it, and actually I needn’t have worried and I don’t think my illnesses have put anybody off working with me at all. I’ve missed a few months here and there having surgery, and I was quite unproductive while I waited for my liver transplant, but my employer and others were very understanding and it all worked out in the finish.

So this is my story, condensed. Now seems like a good time to add that actually my new liver is not working very well. I was pretty ill a few months back and have been in hospital twice this year with it, and it looks certain that I will at some point in the medium term need another liver transplant. I’m not going to hide it away like I did last time, I’m going to be open about it and hopefully other people who might be going through something similar for the first time might feel better about their own situation if they see me doing it.

Someone I work with did an absolutely wonderful post about working and being ill, and perhaps I’ll do one at some point, but this will do for now. Actually now I think about it I probably need to stand up and be counted as one of the many people all over the world who have a stoma because I know some people find it very stigmatising when they get one and they hide it away, so perhaps I’ll talk about that another time too. Mine is very often hanging out the bottom of my jumper and stoma bags are absolutely nothing to be ashamed of. I’m actually rather proud of mine since getting it nearly killed me, but that’s a story for another time.

Sorry about the sound quality, I have made a start on subtitles but it takes absolutely ages, I have turned on community uploads if anybody feels like giving me a hand with it.

NHS-R conference

So I recently just got back from the NHS-R community conference, which was amazing of course, and it’s got me in the mood to share, so I’m writing a few blog posts. I’ve got some more in depth stuff to say about where I think NHS-R is/ should be going, but this is the “feels” one.

As I mentioned on Twitter, I love the NHS and I love R so I’m obviously going to enjoy the NHS-R conference. It was really good and the standard of talks and workshops was really high. Much bigger and better than last year.

It was so inspiring to be there. To be honest I get a bit depressed about the state of analytics in the NHS (as you’ll see from my Twitter) but you would never guess it for those two days. The big feeling that I got while I was there is that I need to up my game. There were people with huge, fully functioning packages, multiple academic publications from their R work, all sorts of complicated machine learning that was beyond my ken, big complicated statistical models, you name it really.

It’s such a wonderful feeling, to escape the inane drudgery of idiotic performance reporting and looking at my seven millionth SPC and being asked if three points of increase is a trend, and to be in the company of people who had done truly remarkable things with R. Remarkable things, I should add, even with the dead hand of bureaucracy and idiotic IT security systems, and all the stuff that I have to do. When Google do cool things, it’s easy to just think “Well, that’s Google. I couldn’t do that where I am”. These people had done stuff inside the NHS and they had the scars to prove it!

I think probably I will do a Shiny workshop at the NHS-R conference every year until I die, but I might try to sneak a paper in there as well actually next time. More on this subject once the work flowers a bit, I hope.