Better Git commit messages

Something else I’m trying to be better at is using Git. I did use it, briefly, a few years back but I never quite got the hang of it and I’ve reverted to the bad habit of having MainCode/ and TestingCode/ and TryNewFunction/ folders filled with near identical code.

So I’m back on the Git wagon again. Atom (see my previous blog post) has beautiful Git integration, as you’d expect since it was built by the GitHub people. It also enforces a couple of conventions with writing Git commit messages, which inspired me on a Google search which led me to this, a guide to writing better commit messages.

I never even thought about the art of it, but, of course, like code comments, good commit messages are essential for collaborating with anyone, even your future self.

Ellen Townsend: Small talk saves lives — IMH Blog (Nottingham)

It sounds much too simple doesn’t it? Making small talk could save a life. But the truth is, it really could. Today SHRG is supporting the campaign launched by the Samaritans. They are asking us all to be courageous and strike up a conversation with someone if we are worried about them at a railway […]

via Ellen Townsend: Small talk saves lives — IMH Blog (Nottingham)

Filtering data straight into a plot with tidyverse

I’m still trying to go full tidyverse, as I believe I mentioned a while back. It’s clearly a highly useful approach, but on top of this I see a load of code in blogs and tutorials that uses a tidy approach. So unless I learn it I’m not going to have a lot of luck reading it. I saw somebody do the following a little while back and I really like it so I thought I’d share it.

In days gone by I would draw lots of graphs in an RMarkdown document like this:

firstFilteredDataset = subset(wholeData, 
  Date > as.Date("2017-04-01"))

  aes(x = X1, y = y)) + geom_... etc.

secondFilteredDataset = subset(wholeData, 
  Date > as.Date("2015-01-01"))

  aes(x = X1, y = y)) + geom_... etc.

thirdFilteredDataset = ... etc.

It’s fine, there’s nothing wrong with doing that, really. The two drawbacks are firstly that the code looks a bit ungainly, creating lots of objects that are used once and then forgotten about, and secondly it is filling your RAM with data. Not really a problem on my main box, which has 16GB of RAM, but it’s a bad habit and you may come unstuck somewhere else where RAM is more limited- like for example when you’re running code on a server.

So I saw some code on the internet the other day and they just piped data straight from a dplyr filter statement to a ggplot instruction. No muss, no fuss, the data is defined in the same function in which it’s used, and you’re not making loads of objects and leaving them lying around. So here’s an example:


mpg %>% 
  filter(cyl == 4) %>%
  group_by(drv) %>%
  summarise(highwayMilesPG = mean(hwy)) %>%
  ggplot(aes(x = drv, y = highwayMilesPG)) +
  geom_bar(stat = "identity")

There’s only one word for it- it’s tidy! I like it!

One editor to rule them all- Atom

I’m very happy using RStudio for all my R code. It goes without saying that the support for R coding built into RStudio is phenomenal. If you don’t know loads of cool stuff RStudio does, you’re missing out, but that’s a blog post on its own.

I’ve never quite been happy with my choice for other general editing, though. Sometimes I write PHP, HTML, markdown, Python, or something else, and I’ve never really found an editor that I love. Geany is pretty good and that’s what I have been using when I write PHP or HTML. I tended to write markdown in RStudio, which is kind of stupid, since RStudio is an awfully big hammer to crack that nut, but it does support markdown and I’m familiar with RStudio, so I was happy enough doing that. I never really found a Python IDE that I loved. As far as I can tell there isn’t really an RStudio equivalent in the Python world, something so well featured and brilliant that it’s really the only choice unless you have a very particular reason to use something else.

So about a year ago I gave Atom a try. It had been out of beta for about a year by that point. I don’t really remember it clearly now but it seemed a bit clunky and I just rapidly gave up (to be fair, this may have just been me being thick, I’ve no idea how much it has really improved since). It just didn’t grab me. I keep seeing it mentioned everywhere and I thought I would give it another go.

This time I was hooked straight away. It’s described as a “hackable editor for the 21st Century” and that’s the real strength of it. The actual interface is very clean and simple, no bells and whistles, but it comes bundled with some plugins and there is a thriving ecosystem of user contributed packages that can make Atom, it seems so far, anything you want it to be.

I think I love Atom for the same reason I love R. It has a big ecosystem of packages around it, and whatever problem you want to solve, as Apple almost said of the iPhone, “there’s a package for that”.

Your needs will be different from mine, of course, but I recommend you give Atom a try if you haven’t already. It supports Markdown preview out of the box. So far I have installed two packages- platformio-ide-terminal, and script. Platformio-ide-terminal allows you to spawn a terminal underneath your code window, which I have been mainly using to run pandoc on my markdown files. Script will run your code for you (sections, the whole thing, etc.) all with a shortcut key. So far I’ve been using that for testing Python scripts. Oh yes, and the markdown editor supports word completion out of the box too, not code, just normal words, which is more useful than it sounds.

While I’ve been Googling Atom to find the links to put in this post I have found two really cool things that I didn’t know about. Firstly, there is the Hydrogen package, which allows Jupyter like functionality in Atom. If you don’t know what Jupyter is, you should find out, but essentially it allows you to weave together your code and output, just like you can with RMarkdown.

And secondly Atom themselves have just released teletype which is a tool that allows collaboration on code files right inside the Atom editor. I don’t really need to do that, not that I can think of anyway, but you have to admit it’s pretty awesome. They’ve solved a lot of the problems with code collaboration elsewhere, as well, have a look at the blog post for more details.

So go give Atom a try. I’ll try to post any more Atom-related awesomeness that I see on my travels.

Lazy tables with R and pander

One of the many things I love about R is how gloriously lazy it can help you to be. I’m writing a report at the moment and I need to make lots of tables in R Markdown. I need them to be proportions, expressed as a percentage, rounded to 0 decimal places, and I need to add (%) to each label on the table. That’s a lot of code when you’ve got 8 or 10 tables to draw, so I just made a function that does it. It takes two arguments, the variable you want tabulated, and the order in which you want the table. I need to specify the order manually because the default alphabetical ordering doesn’t work with all of the data that I want, as in the example here. Without ordering manually, the “Less than one year” category appears at the end.

Here’s a minimal example:


niceTable = function(x, y) {
  tempTable = round(
               levels = y)
    ) * 100, 0)
  names(tempTable) = paste0(names(tempTable), " (%)")

a = c(rep("Less than one year", 3), rep("1 - 5 years", 4), 
    rep("5 - 10 years", 2))

niceTable(a, c("Less than one year", 
    "1 - 5 years", "5 - 10 years"))

Boom! Instant laziness. Unless my boss is reading, in which case it’s efficiency 🙂

Quarters and modulo arithmetic

This is another post that’s mainly for my benefit when I inevitably forget. I’m working with dates in PHP which, unlike MySQL, does not have a built in quarter function for extracting the quarter from a year.

Even if it did, one would have to be very careful with it because quarters are actually defined differently in different countries. In the UK, where I am, April to June is the first quarter (and January to March the last). In other countries, including (I think) the US, the first quarter is January to March, and October to December the last.

With no quarter function, and the hypothetical threat of an unpredictable one, my next recourse is to divide the month by 3 in order to give me a quarter number. This is done very simply as:

echo ceil(date("m") / 3);

This will return 1 when the month is 1 to 3 (January to March), 2 when it’s 4 to 6 (April to June) etc. But as I mentioned before UK quarters don’t work like this. We don’t have quarters in a sequence 1, 2, 3, 4. They go in a sequence like 4, 1, 2, 3. Now I could write code that deals with each of those cases individually, converting 1 to 4, 2 to 1, 3 to 2, etc., but I thought it was better to do it properly (and store up generalisability for the future) by converting the 1, 2, 3, 4 sequence with modulo arithmetic. I confess, I still can’t work out how to do this other than by trial and error but it definitely involves modulo 4 because the sequence is of period 4.

To convert 1, 2, 3, 4 to 4, 1, 2, 3 one need only do the following, where x is the input value 1, 2, 3, 4:

4 – (5 – x) %% 4

As far as I can figure it the first constant (4) tells the sequence how big it should be (4, 1, 2, 3 or 10, 7, 8, 9) and the second constant (5) tells the series where it should flip back to the start (4, 1, 2, 3 or 3, 4, 1, 2).

That’s all I know, I’m really busy and don’t have time to think about it any more. Hopefully this will help someone/ me in two years’ time.

Consuming REST APIs with PHP and CURL

I wasted such a lot of time on this that I must commit it to the internet on the off chance that it helps someone else in the same situation.

If you are using PHP to consume a RESTful API via CURL and you want to manipulate the data you get back it’s very important that you set CURLOPT_RETURNTRANSFER to true. This allows you to collect the response from the server in a variable. If you don’t set this option it will just echo the return to the screen, which is obviously of no use whatsoever.

While I’m here I may as well mention as well that if you want json_decode to return an array you need to use json_decode($result, true); otherwise you get an object back. The final code I wrote looks like this:

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "http://YOUR_URL_HERE");
$result = curl_exec($ch);


$result = json_decode($result, true); // giving true to json_decode returns array

Converting a grouped plyr::ddply() to dplyr

I’m going full tidyverse at the moment and so I’m converting my old plyr code to dplyr. It’s been pretty steady going so far, although I had a bit of difficulty converting an instruction using ddply which carried out a function based on a subgrouping within the data. I wrote a toy example to get it right, I may as well share it with the internet in case it helps someone else. If you can’t figure out what I’m talking about with what the function does, just run the code. You’ll see what it is. Easier than explaining in words. The following two functions carry out the same task, the latter is a translation of the former.


plyr::ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))

mtcars %>% 
  dplyr::group_by(cyl) %>% 
  dplyr::mutate(mean.mpg = mean(mpg))

Munging Patient Opinion data with R

I’ve written a post about using Patient Opinion’s new API, focusing on how to download your data using R but also perhaps useful to those using other languages as a gentle introduction, which can be found here.

This post focuses on the specific data operations which I perform in R in order to get the data nicely formatted and stored on our server. This may be of interest to those using R who want to do something very similar, much of this code will probably work in a similar context.

The code shown checks for the most recent Patient Opinion data every night and, if there is new data that is not yet on the database, downloads and processes it and uploads it to our own database. In the past I had downloaded the whole set every night and simply wiped the previous version, but this is not possible now we are starting to add our own data codes to the Patient Opinion data (enriching the metadata, for example by improving the match to our service tree all the way to team level where possible).

First things first, we’ll be using the lappend function, which I stole from StackExchange a very long time ago and use all the time. I’ve often thought of starting a campaign to get it included in base R but then I thought lots of people must have that idea about other functions and it probably really annoys the base R people, so I just define it whenever I need to use it. One day I will set up a proper R starting environment and include it in there, I use R on a lot of machines and so have always been frightened to open that particular can of worms.


### lappend function

lappend <- function(lst, obj) {

  lst[[length(lst)+1]] <- obj


Now we store our api key somewhere where we can get it again and query our database to see the date of the most recent Patient Opinion story. We will issue a dbDisconnect(mydb) right at the end after we have finished the upload. Note that we add one to the date because we only want stories that are more recent than the database, and the API searches on dates after and including the given date.

We need to convert this date into a string like the following “01%2F08%2F2015” because this is the format the API accepts dates in. This date is the 1st August 2015.

apiKey = "SUBSCRIPTION_KEY qwertyuiop"

# connect to database

theDriver <- dbDriver("MySQL")

mydb = dbConnect(theDriver, user = "myusername", password = "mypassword",
  dbname = "dbName", host = "localhost")

dateQuery = as.Date(dbGetQuery(mydb, "SELECT MAX(Date) FROM POFinal")[, 1]) + 1

dateFinal = paste0(substr(dateQuery, 6, 7), "%2F", substr(dateQuery, 9, 10), "%2F", substr(dateQuery, 1, 4))

Now it’s time to fetch the data from Patient Opinion and process it. We will produce an empty list and then add 100 stories to it (the maximum number the API will return from a single request), using a loop to take the next hundred, and the next hundred, until there are no more stories (it’s highly unlikely that there would be over 100 stories just in one day, obviously, but no harm in making the code cope with such an eventuality).

The API accepts a value of skip which allows you to select which story you want, in order of appearance, so this is set at 0, and then 100, and then 200, and so on within a loop to fetch all of the data.

# produce empty list to lappend

opinionList = list()

# set skip at 0 and make continue TRUE until no stories are returned

skip = 0

continue = TRUE


  opinions = GET(paste0("",
  skip, "&submittedonafter=", dateFinal),
  add_headers(Authorization = apiKey))

  if(length(content(opinions)) == 0){ # if there are no stories then stop

  continue = FALSE

  opinionList = c(opinionList, content(opinions)) # add the stories to the list
  # increase skip, and repeat

  skip = skip + 100


Having done this we test to see if there are any stories since yesterday. If there are, the length of opinionList will be greater than 0 (that is, length(opinionList) > 0). If there are we extract the relevant bits we want from each list ready to build it into a dataframe

# if there are no new stories just miss this entire bit out

if(length(opinionList) > 0){

  keyID = lapply(opinionList, "[[", "id")
  title = lapply(opinionList, "[[", "title")
  story = lapply(opinionList, "[[", "body")
  date = lapply(opinionList, "[[", "dateOfPublication")
  criticality = lapply(opinionList, "[[", "criticality")

The service from which each story originates is slightly buried inside opinionList, and you’ll need to step through with lapply, extracting the $links element, and then take the object that results and step through it, taking the 5th list element and extracting the element named “id”. This actually makes more sense to explain in code:

# location is inside $links

linkList = lapply(opinionList, "[[", "links")
location = lapply(linkList, function(x) x[[5]][['id']])

# Some general tidying up and we're ready to put in a dataframe

# there are null values in some of these, need converting to NA
# before they're put in the dataframe

date[sapply(date, is.null)] = NA
criticality[sapply(criticality, is.null)] = NA

finalData = data.frame("keyID" = unlist(keyID),
  "Title" = unlist(title),
  "PO" = unlist(story),
  "Date" = as.Date(substr(unlist(date), 1, 10)),
  "criticality" = unlist(criticality),
  "location" = unlist(location),
  stringsAsFactors = FALSE

Now we have the data in a dataframe it’s time to match up the location codes that Patient Opinion use with the codes that we use internally. There’s nothing earth shattering in the code so I won’t laboriously explain it all, there may be a few lines in it that could help you with your own data.

### match up NACS codes

POLookup = read.csv("poLookup.csv", stringsAsFactors = FALSE)

### some of the cases are inconsistent, make them all lower case

finalData$location = tolower(finalData$location)

POLookup$NACS = tolower(POLookup$NACS)

PO1 = merge(finalData, POLookup, by.x = "location", by.y = "NACS", all.x = TRUE)

### clean the data, removing HTML tags

PO1$PO = gsub("<(.|\n)*?>", "", PO1$PO)

PO1$Date = as.Date(PO1$Date)

poQuarters = as.numeric(substr(quarters(PO1$Date), 2, 2))

poYears = as.numeric(format(PO1$Date, "%Y"))

PO1$Time = poQuarters + (poYears - 2009) * 4 - 1

### write PO to database

# all team codes must be present

PO1 = PO1[!$TeamC), ]

# strip out line returns

PO1$PO = gsub(pattern = "\r", replacement = "", x = PO1$PO)

The following line is perhaps the most interesting. There are some smileys in our data, encoded in UTF-8, and I don’t know exactly why but RMySQL is not playing nicely with them. I’m sure there is a clever way to sort this problem but I have a lot of other work to do so I confess I just destroyed them as follows

# strip out the UTF-8 smileys

PO1$PO = iconv(PO1$PO, "ASCII", "UTF-8", sub = "")

And with all that done we can use dbWriteTable to upload the data to the database

# write to database

dbWriteTable(mydb, "POFinal", PO1[, c("keyID", "Title", "PO", "Date", "Time",
             "TeamC")], append = TRUE, row.names = FALSE)

That’s it! All done

Querying the Patient Opinion API from R

Patient Opinion have a new API out. They are retiring the old API, so you’ll need to transfer over to the new one if you are currently using the old version, but in any case there are a number of advantages to doing so. The biggest one for me is the option to receive results in JSON (or JSONP) rather than XML (although you can still have XML if you want). The old XML output was fiddly and annoying to parse so personally I’m glad to see the back of it (you can see the horrible hacky code I wrote to parse the old version here). The JSON comes out beautifully using the built in functions from the R package httr which I shall be using to download and process the data.

There is a lot more information in the new API as well, including tags, the URL of the story, responses, status of the story (has response, change planned, etc.). I haven’t even remotely begun to think of all the new exciting things we can do with all this information, but I confess to being pretty excited about it.

With all that said, let’s have a look at how we can get this data into R with the minimum of effort. If you’re not using R, hopefully this guide may be of some use to you to show you the general direction of travel, whichever language you’re using.

This guide should be read in conjunction with the excellent API documentation at Patient Opinion which can be found here.

First things first, let’s get access to the API. You’ll need a key which can be obtained here. Get the HTTP one, not the Uri version which only lasts for 24 hours.

We’ll be using the HTTP protocol to access the API, there’s a nice introduction to HTTP linked from the httr package here. We will be using the aforementioned httr package in order to make requests using the HTTP protocol.

Let’s get started

# put the subscription key header somewhere (this is not a real API key, of course, insert your own here)

apiKey = "SUBSCRIPTION_KEY qwertyuiop"

# run the GET command using add_headers to add the authentication header

stories = GET("",
              add_headers(Authorization = apiKey))

This will return the stories in the default JSON format. If you prefer XML, which I strongly advise you don’t, fetch the stories like this:

stories = GET("",
add_headers(Authorization = apiKey, Accept = "text/xml") )

For those of you who are using something other than R to access the API, note that the above add_headers() command is equivalent to adding the following to your HTTP header:

Authorization: SUBSCRIPTION_KEY qwertyuiop

You can extract the data from the JSON or XML object which is returned very simply by using:

storiesList = content(stories)

This will return the data in a list. There is more on extracting the particular data that you want in another blog post which is parallel to this one here, for brevity we can say here that you can return all the titles by running:

title = lapply(opinionList, "[[", "title")

That’s it. You’re done. You’ve done it. There’s loads more about what’s available and how to search it in the help pages, hopefully this was a useful intro for R users.