Data science accelerator lesson one- build a pipeline and ship the code!

My exciting news is that I was accepted onto the data science accelerator and have been doing it since late December. My project, basically, is all about using natural language processing to better understand the patient experience data that we collect (and, if I have time, the staff experience data too). Here are the goals:

1) Using an unsupervised technique, generate a novel way of categorising the text data to give us a different perspective. We already have tagged data but I would like to interrogate the usefulness of the tags that we have used
2) a. Generate a system that, given a comment or set of comments, can find other comments within the data that are semantically similar. Note that this system will need to run live on the server, since it would be impossible to store the semantic similarity of every comment to every other comment
3) b. Generate a system that, instead of searching by word, searches by similarity to that word
3) Produce a supervised learning algorithm which can be trained on a sample of tagged comments and then produce tags for comments that it has not previously seen
4) a. Produce a sentiment analysis function that can tag every comment in a database with how positive or negative it is
4) b. Produce reporting functions that can compute overall sentiment for a group of documents (e.g. of a particular service area) and optionally describe the change in sentiment over time

I’m not really sure if I’m going to get through all of it but I’ve made a decent start. I’ve made a Trello board and there’s a GitHub too.

One of the things about the project I haven’t mentioned above is that I want to make something that can easily be picked up and used by other Trusts. There are loads of companies who want to charge money for NHS Trusts to use their black box but I’m trying to make something others can use and build on. So a lot of the work at the end will be on that.

Anyway, I’ll share the work at the end but I’ve learned loads already so I thought I’d talk about that. It’s the best thing I’ve done since my PhD in terms of learning (so I recommend your doing it!) so there are lots of things I want to talk about.

The first one isn’t super complicated or a revelation to anyone but it’s affected the way I work already. It’s this. Ship code! Ship it! Get it out the door!

Up to now to be honest my agile working has pretty much been that I spend six months on something, I release it, it’s terrible, and then I make improvements. One of the product managers at GDS told me that they ship code every week. Every week! I couldn’t believe it. So I’m trying to work like that. Doesn’t matter if it isn’t perfect, doesn’t matter if some of the functionality is reduced, just get something in the hands of your users. Then if they hate it you can avoid spending a month building something they hate.

And, something related to that, start with a pipeline. Right at the start of an analysis, start building the outputs. This helps you to know what all this analysis is actually going to do. It helps you to make the analysis better. And it gives you code that you can ship. Build something that works and does something and give it to your users. They will start giving you feedback before you’ve even finished the analysis. Too often we start with the analysis and only think about the endpoint when we’ve finished. Maybe it’s the wrong analysis. Maybe what you’re doing is clever but no-one cares. Build a pipeline and get it out the door. Let your users tell you what they want.

More on this as the project proceeds

New Year’s post

So this is my annual New Year’s post, it’s an idea from David Allen that I’ve done before.

2018 was the year that I was finally (pretty much) better. I had my bowel and spleen removed in August 2017 and bled in quite a scary way, but by the time 2018 rolled around I was running 9 miles, building mileage ready for a marathon in May. It’s been a great year. I did have some pretty scary health problems (that I won’t go into) but it all worked out in the finish and I’m pretty much back to working and being happy with my family just as I was all the way back in 2014 before everything started to go wrong.

So the first big news of 2018 was I got a new job (March). I’m now half time in my old job, and half time in the new one. They both link together, in that I do Shiny code and server admin to facilitate several dashboards that we use for staff and patient experience and some clinical stuff too. I’m absolutely loving both jobs. We’re doing a lot of stuff that is pretty new in the NHS, we’re the first Trust that I’m aware of with a Shiny Pro licence, and I’ve talked to people all over the country about what we’re doing. My Trust is very supportive and it’s all going really well.

I suppose my next big thing was running a marathon in May. It wasn’t as fast as I would have liked (four hours and thirty three minutes), but I did have quite a few pretty serious problems with being ill so I did pretty well considering. I’ve got another one in 2019, more on which later. Next up was the British Transplant Games (July). It was my first time competing and I won bronze at 5K and 1500m, which was very nice. My big goal now is to qualify for the world games, I’m guessing silver would be enough to do that, partly depends on the time I guess, too.

For the first time in my life I have a savings account, which is absolutely great, and I’m trying (and failing) to save three times my salary by February 2019. My wife and kids are a lot happier now I’m better, we all really went through hell so it’s been a great year just doing normal stuff you can’t do when you’re ill, like go abroad.

And the most recent thing that’s exciting is starting the data science accelerator. I keep meaning to blog more about it, I’m overbusy just doing it, but there’s a GitHub with some of the first steps on here. Text analysis is really easy to do, and really hard to do well, so I’m really glad that I’ve got a mentor to guide me through it. I’m working really hard trying to build some robust analysis and hopefully deploy some sort of reusable pipeline that other NHS Trusts can use. I’ve been dabbling in Python, too, which I really want to do more of. I feel like having other languages could help me build more and better stuff more easily. I’ve really bought into the whole agile development thing, as well, that’s a part of the accelerator, and one of the talks by a product manager was fascinating.

So that’s it for 2018. I’m starting 2019 strong. Very strong. My health was a teensy bit wobbly this time last year but I’m as strong as an ox at the moment. It really feels good after being frightened and weak for such a long time. At work I want to tie up all the threads and be a part of an agile data science team, producing robust statistical analyses using interactive web based technologies (like Shiny!).

Oh yes! That reminds me. Of course last year I also wrote a book and I started teaching statistics to Trust staff- advanced stuff in 12 one hour tutorials and a quick review in a 2 and a half hour lecture. The teaching has been going really well, I’ve been getting a lot of good feedback and I’m really hoping it helps my Trust do better with the data it collects.

I live a blessed life. All I want to do in 2019 is keep doing the same stuff I already do and love. Run a faster marathon (sub 4 hour). Get a silver medal. Be part of a bigger, better data science team. Keep doing all the stuff we’re doing with Shiny and text analysis and help other Trusts to do it, too. And more than anything else I want my family to just live a normal life and forget about all that stuff between 2015-2017.

Oh yeah. And I want to solve the Rubik’s cube, too. My eldest got one for Christmas and it’s got me hooked.

That’s me for 2019. Good luck with whatever you’ve got on your plate for the year 🙂

A tidy text talk and some stuff from my Shiny book

This is just a quick post to show some of the stuff that I’ve published/ presented/ put on GitHub recently.

So my Shiny book, Web Application Development with R using Shiny, 3rd edition, is out, and I’ve forked the code from the publisher’s repository onto my own GitHub to make it easier to find.

And I’ve got another book about Shiny out which is all about UI development with Shiny, and I’ve forked the code from that onto my GitHub, too

I’ve also put the talk I did about text processing at the Nottingham useR onto my GitHub as well.

Find the Shiny book code here.
The UI development book is here.
And the tidy text talk here.

Drawing stacked barcharts the tidyverse way

Don’t judge me, but I do spend quite a lot of time making stacked barcharts of survey data. I’m trying to go full tidyverse (see this blog, passim) and there is a very neat way of doing it in the tidyverse which is very easy to expand out into functions, or purrr, or whatever. Of course, me being me I can never remember what it is so I end up reading the same SO answers over and over again.

So here, for all time, and mainly for my benefit, is the code to take a dataframe with the same Likert items over and over (in this case, Always/ Usually/ Sometimes/ Never) and find the proportions, change to percentages, make the x axis labels at an angle so they fit, put the stacking in the right order, and remove missing values. It’s pretty much completely generic, just give it a dataframe and the factor levels and it will work with any dataset of lots of repeated Likert items.

theData %>% 
  gather() %>% 
  group_by(key) %>% 
  count(value) %>%
  filter(! %>% 
  mutate(prop = prop.table(n) * 100) %>%
  mutate(value = factor(value, levels = c("Always", "Usually", "Sometimes", "Never"), 
                        labels = c("Always", "Usually", "Sometimes", "Never"))) %>% 
  ggplot(aes(x = key, y = prop, fill = value, order = -as.numeric(prop))) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Et voila!


Produce tables in loop

Another one for me before I forget, I seem to have terrible problems writing headings and tables in a loop in RMarkdown. Everything gets crushed together because the line returns don’t seem to work properly and either the table or the headings get mangled. Well here it is, the thing I just did that worked:

for(i in 1:2){

  cat("Heading ", i, "\n")

That’s it! Don’t put any other weird “\n”s or [HTML line breaks which I now realise WordPress will interpret literally] in. You don’t need them!

The shinyverse

I’m pretty sure I’ve made up the word shinyverse, to refer to all the packages that either use or complement Shiny. This is a non-exhaustive list of shinyverse packages, mainly for my benefit when I inevitably forget, but if it’s useful for you too then that’s good too.

shiny.semantic adds support for the Semantic UI library to Shiny. Useful if you want your Shiny apps to look a bit different or there’s something in the library that you need.

shinyaframe doesn’t particularly sound like the kind of thing that everyone reading this will drop everything to do on a rainy weekend, but it’s awesome so it’s worth mentioning here. It allows you to use Mozilla A-Frame to create virtual reality visualisations with Shiny!

shinycssloaders does very nice simple loading animations for while your Shiny application is thinking about how to draw an output.

shinydashboard is so commonly used that I tend to think of it as what you might call “base-Shiny” (another terms I made up). If you’ve never made a Shiny dashboard I suggest you have a look.

shinyWidgets is a lovely package that extends the number of widgets you can use. I’ve used the date picker in the past but there are plenty to choose from.

shinythemes is another one that feels like base-Shiny. Change the appearance of you Shiny apps very easily by selecting one of several themes in the UI preamble.

shinytest is for testing Shiny applications.

shinymaterial allows you to use a UI design based on the Google Material look.

shinyLP allows you to make a landing page for Shiny apps, either to give more information or to allow a user to select one of several applications.

shinyjs. Another base-Shiny one. Easily add canned or custom JavaScript

shinyFiles allows the user to navigate the file system on the server.

bsplus allows you to add lots of things which are in Bootstrap but not in Shiny to your apps (and to RMarkdown, to boot).

rintrojs allows you to make JavaScript powered introductions to your app. Pretty neat.

crosstalk is seriously neat and allows you to use Shiny widgets that have been programmed to interact with each other, or write this interaction yourself if you’re a package author. Once you or someone else has done this, Shiny (or RMarkdown) outputs can automatically react to each other’s state (filtering, brushing etc.)

Filling in non missing elements by row (Survey Monkey masterclass)

There must be other people out there doing this, so I’ll share this neat thing I found today. Quite often, you’ll have a dataset that has a lot of columns, only one of which will have something in it for each row. Survey Monkey, in particular, produces these kinds of sheets if you use branching logic. So if you work in one bit of the Trust, your team name will be found in one column, but if you branched elsewhere in the survey because you work somewhere else, it is in another column. Lots of columns, but each individual has only one non-missing element in each (because they only work in one place and only see that question once).

In the past I’ve done hacky things with finding non missing elements in each column line by line and then gluing it all together. Well today I’ve got 24 columns so I wasn’t going to do it like that today. Sure enough, dplyr has a nice way of doing it, using coalesce.

coalesce(!!!(staffData %>%
               select(TeamC1 : TeamC24)))

That’s it! It accepts vectors of the same length or, as in this case, a dataframe with !!! around it. Bang bang bang (as it’s pronounced) is like bang bang (!!) except it substitutes lots of things in one place instead of one thing in one place. So it’s like feeding coalesce a lot of vectors, which is what it wants.

I really can’t believe it’s that simple.

In praise of awkward questions

I went to a conference last week, more of a meet up really, and they presented the results of the work that we’ve all been doing, indicating that there were several statistically significant improvements in the expected direction.

I’m sure the analysis is well intentioned and basically correct, so I didn’t really have any problem with it, but my arm shot up anyway, because I wanted to see more details of the analysis. The results were so good, I was just curious to see how they were so good really- what were the 95% confidence intervals, the sample sizes, alpha levels, just the nitty gritty of the analysis so I could really get all the detail of it.

But they didn’t have it. They didn’t have much on the slides, and they hadn’t brought any supplementary materials.

I don’t have a problem with that either. I think probably the reason why they didn’t is they don’t usually have stats wonks sitting at the back asking awkward questions. So they didn’t feel the need to be prepared.

I cut my teeth (whille doing my PhD) at academic conferences. Asking awkward questions about statistical analyses is a spectator sport there. And I love it. Everyone’s a black hat, just waiting to crawl inside your work and blow it apart. It’s like a duel, like a competition. And of course that’s how science works. Everyone’s desperate to prove you wrong, and if you can stay at the top, then fair play to you. Probably something in it.

There isn’t enough of that where I work, in the NHS. It really reinforced to me the need to keep going with what I’m doing. I want to train, face to face, everyone in my whole organisation who works with data. Two hours or fifteen, I want to equip them to ask awkward questions.

And one day, I want to sit at the back of the room with my arms folded and watch a whole roomful of arms shoot up.

THEN I can relax.

Jupyter, Python, interactive web frameworks, and more

A couple of days ago on Twitter I said the following:

“Increasingly, RStudio’s products are so good that I feel a lot of my advice to my organisation is “buy a lot of RStudio products”. I love RStudio (I have a tattoo of their logo on my arm, even!) and they clearly give a lot of stuff away (we used their products for nothing for *years*). But I wish I could at least acknowledge some competition in this arena. As far as I can see, if you want to develop a cutting edge data science team with R, it’s RStudio all the way.

I just feel like a brand ambassador rather than someone giving solid, independent advice. I think it’s good advice, don’t get me wrong, but what are the alternatives if you want to interact with R over an authenticated connection other than Shiny with Server Pro?”

I’ve been thinking about it more and more over the last few days and I think my perspective has shifted because my role in my organisation has changed a little. For quite a long time I’ve been churning out Shiny code (and PHP/ MySQL) to run our patient experience data portal. But now I’ve zoomed out a bit and although I’m still doing that I’m working on a couple of other projects and have started to think about how we build our team- using Git, coding collaboratively, enforcing a code style. I’m starting to think about tools.

Now I love RStudio. As I say in the tweet I have a tattoo of their logo on my arm. This is not a metaphor. An actual tattoo on my actual arm. But I’m starting to get concerned that I’m so focused on R and Shiny that I’m starting to miss the big picture. It’s my job to know about other approaches and to understand the benefits and drawbacks of each. Even if I end up saying “There are two other ways of doing this that don’t involve Shiny, but they’re both too difficult/ expensive/ unreliable/ whatever” then fine. But I just don’t feel comfortable advising my organisation without a wider view.

So I’m going to veer off a bit, test the water. I feel sure this will involve Python, Jupyter, and whatever reactive type programming thing Python people use (see? I’m clueless). I know a bit of Python so it shouldn’t be too arduous. And I’ll poke around the rest of the space too. Julia. Other ways of interacting with R that don’t involve Shiny.

As always, I’ll report back once I’ve made head or tail of it, which could be a while.

Producing several plots at once with RMarkdown and purrr

Purrr example

This is a very simple example of using purrr and RMarkdown to produce several plots all at once.

invisible( # suppress console output from hist()
  map(c(5, 6, 7), function(x) { # values to feed to filter function
    iris %>%
      filter(Sepal.Length < x) %>% # just Sepal.Length < 5, 6, and 7
      pull(Petal.Width) %>% # extract Petal.Width vector
      hist(breaks = 20, main = paste0("Histogram of < ", x)) # graph