Because it’s Friday- animated graphs

I’ve been analysing some data and we’re a bit unsure, as we often are, as to its quality. We have a variety of clinicians rating C.tot and R.tot over a four year period for a set of patients. To summarise, there was some concern that people are rating C.tot consistently and well but not so with R.tot (I’m not going to explain the variables or where they come from because it’s not relevant and it’s an internal dataset we’re not ready to share yet).

I’m sure there is a statistical approach (essentially we need to check to see if the R.tot scores are jumping around) but I thought a graphical approach might suffice for now.

So I talk all the individuals who had been in the dataset for long enough, and put each of their score trajectories on a graph, and then animated so you see the first patient, then the second and the first, then the first second and third, etc.

Actually the R.tot scores look pretty sensible looking at the animation. Have a look and see what you think (click to animate).

So easy to achieve with the mighty ggplot and animation packages from R, here’s all the code


library(animation)
library(ggplot2)

mydf=melt(mytemp, measure.vars=c("R.tot", "C.tot"), id.vars=c("ID", "Week"))

for (i in unique(mydf$ID)) {
  
  png(filename=paste(i, ".png", sep=""))
  
  print(
    ggplot(subset(mydf, ID <= i), aes(x=Week, y=value)) + 
      geom_line(aes(group=ID)) +
      xlab("Week") + 
      opts(title = "Clinical and risk score over time") +
      facet_wrap(~variable, ncol=1)
    )
  dev.off()
  
}

## change outdir to the current working directory
ani.options(outdir = getwd(), interval=0.05)

im.convert(paste(unique(mydf$ID), ".png", sep=""), output = "animation.gif")

Amazing video- Inventing on principle

I’ve just seen an amazing video, Brett Victor talking about Inventing on Principle.

It should be of great interest to programmers and creatives and indeed anyone who needs or wants to come up with new and exciting ideas (everyone, in other words). If you’re not a programmer, stick with the video because he broadens out the discussion a lot a third of the way in. And if you are, just get stuck in and watch. You won’t regret it.

Stupidest pie chart of the year 2011

I’ve just been looking at the Stupidest bar chart of the year 2011 and I’ve been inspired to submit my Stupidest pie chart of the year 2011. I won’t say from where I obtained it and I’ve drawn the data myself to save from annoying the original authors. And here it is (click to enlarge):

It represents the amount of maternity and paternity leave available in different countries. Pie charts are often not a very good idea, but in this case they are the worst possible idea. Pie charts should be used when the whole pie has some meaning. A whole customer base, individuals living in Nottinghamshire, something like that. The whole of these pies represents- what? The total amount of leave in these countries. This has no real world meaning at all, and the whole point of the pie chart is lost.

Even worse, underneath the pie chart they are forced to write “Spare a thought for parents in the USA and Sierra Leone… paid maternity leave 0 weeks, paid paternity leave 0 weeks.” because you cannot represent 0 on a pie chart! This should have set alarm bells ringing. One better way to plot these data:

This is absolutely factory-fresh out of the box settings, there are many ways to improve this plot and other types of plots which could be used. This plot improves on the previous one by:

1. Better able to compare levels of leave in each country
2. Better able to compare levels of each type of leave
3. No need for data labels which spell out number of weeks in each country and contribute to very low data:ink ratio
4. Able to display zero points which puts the marginal notes about USA and Sierra Leone on the plot!

Code for both:


par(mfrow=c(1, 2))

maternity=c(18, 16, 39, 17, 16, 0, 0)
paternity=c(3, 3, 14, 3, 14, 0, 0)
country=c("China", "Holland", "UK", "Greece", "France", "USA", "Sierra Leone")

pie(maternity, labels=paste(country, maternity, "weeks"), main="Maternity leave")

pie(paternity, labels=paste(country, paternity, "weeks"), main="Paternity leave")

# 2 minute bar chart

par(mar=c(10, 4, 4, 2) + 0.1)

barplot(maternity, names.arg=country, ylab="Maternity leave in weeks", las=3)

barplot(paternity, names.arg=country, ylab="Paternity leave in weeks", las=3)

New Year’s resolutions

It’s that time of year again, so what do I need to keep doing or do differently this year?

1. Keep being organised with Evernote and Getting Things Done. I found recently that I was processing so many bits of paper and information that I just gave up. Only the most important stuff got stored in my brain, the rest was discarded. Having started using Evernote and guided by Getting Things Done I have a massive brain pack to store everything I can’t use at the moment.

2. Produce reproducible reports. I think it’s a very important area of growth in statistical analysis and, moreover, I think it will make my work better (for a start, I’ll spend much longer commenting code and making it readable if it’s being published).

3. Learn something new. I need to learn various bits and pieces relating to interactive graphics this year (Tcl/ TK? Java? HTML5?) but I need to expand my statistical knowledge too. Likely candidates include bootstrapping and Bayesian analysis.

4. Publish, publish, publish. I’m sitting on too many datasets and pieces of work that should be published. Blogging is a good way of sharing small pieces of work and progress but it doesn’t require the care and attention which academic publication entails. I must publish more.

Complex surveys in R

I carried out what I thought was going to be a very simple piece of work the other day, slicing up some of the questions from a publicly available dataset in order to compare Nottinghamshire Healthcare NHS Trust with some other sites based on particular features of interest. I will post that work some other time, for now for those who are interested the source code is here.

In the meantime, I thought I would share my experience with the book which helped me through this work, which is here.

Complex surveys: A guide to analysis using R, which accompanies the survey package (or vice versa?), was a real joy for me to use. As the author explains in the book, complex survey designs are everywhere, and I confess I have been rather ignoring them in my work. There is just the right mix of code, equations, explanation, and some truly wonderful graphics in this book to equip you to deal with them.

So don’t be frightened of complex survey designs! Beg, borrow, or buy this book and get analysing!

Quick presentations

Just a quick one from me today, lots of deadlines colliding, which is handy because that’s what this post is about. I’ve been experimenting with the beamer class for LaTeX, with Sweave, which is another way of saying that I’ve been trying to save time producing slides for presentations directly from code as analysed by R.

The syntax is very simple, open a document with:

documentclass{beamer}
usetheme{Berlin}
title{Title goes here}
author{Chris Beeley}
date{today}
begin{document}

Slides are included with:

begin{frame}
section{Main findings}
… slide content goes here
end{frame}

The main strength, of course, is the ability to include graphics straight from R, and particularly in the analysis I’m doing today, because I’m pulling a lot of variables from an SPSS spreadsheet and making comparisons between the national figures and those for my areas. A lot of the code is the same for each graph.

<<fig=TRUE, echo=FALSE, height=4, width=7>>=

par(mar=c(10, 4, 4, 2) + 0.1)

temp1=apply(maindata[which(names(maindata)=="dhelp1.01"):
which(names(maindata)=="dhelp1.10")], 2, 
function(m) prop.table(table(m))[2])

temp2=apply(subset(maindata, region=="East Midlands")[which
(names(maindata)=="dhelp1.01"): which(names(maindata)
=="dhelp1.10")], 2, function(m) prop.table(table(m))[2])

graphmat=rbind(temp1, temp2)*100

colnames(graphmat)=c("Personal care", "Physical help",
 "Services/ benefits", "Paperwork/ financial",
 "Other practical", "Company", "Taking out", 
"Giving medicines", "Keeping an eye", "Other")

barx=barplot(graphmat, beside=TRUE, ylim=c(0,100), col=c("red", "green"), las=3)

text(barx[1,], graphmat[1,]+5, paste(round(graphmat[1,]), "%"), cex=.5)
text(barx[2,], graphmat[2,]+5, paste(round(graphmat[2,]), "%"), cex=.5)

legend("topright", c("National", "East Midlands"), fill=c("red", "green"))

@

All I need to do is update the names of the variables in the which(names(mydata)==”x”) argument and then put some labels in which will be drawn on the x-axis. I have about 10 of these graphs to make. Just thinking about doing it all by hand makes me feel tired. And if I do decide to change some of the axis labels, or add to or take away from the variables in the graph, it’s just a question of quickly editing the code, and then recompiling to a pdf. Magic.

Simulating data

I try to publish as much of my work as I can on the internet, where the people who own the data agree (my little bit of the website is here) but very often I can’t publish the data because of issues with ongoing projects, other publications, various levels of confidentiality of data, and so on.

So I’ve decided to try to simulate the data whenever there is an issue with publication. It’s very easy and it’s an excellent way for me to be able to show people my work without any of the issues that come from showing real data. The code is below, it comes in two parts for the first set of variables because we fiddled with the questionnaire a bit after Time 8 in the dataset.


mydatafirst[mydatafirst$Time>8,
  which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time>8, which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")],
      2, function (x) sample(x[!is.na(x)], length(mydatafirst$Service[mydatafirst$Time>8]), replace=TRUE))

mydatafirst[mydatafirst$Time<9,
  which(names(mydatafirst)=="Service"):which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time<9,which(names(mydatafirst)=="Service") : 
      which(names(mydatafirst)=="Therapist")],
      2, function (x) sample(x[!is.na(x)], length(mydatafirst$Service[mydatafirst$Time<9]), replace=TRUE))

mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))] =
  apply(mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))], 2, function (x)
    sample(x[!is.na(x)], length(mydatafirst$Imp1), replace=TRUE))

Thanks to the magic of R, and in particular the Sweave and Brew packages, all I need to do is insert these four lines into the code, re-run the report, and I have a nicely simulated dataset. I must confess I didn’t use R to convert the comments to gibberish, it was easier to download them from here, but if this website didn’t exist then I certainly could have used R to do this very easily.

Something else that R and Sweave are really helping me with at the moment is making it possible to start to analyse data and compile reports before the data comes in. Because Sweave will automatically put together the statistics and graphs for me as I go along, it frees me up to just work on the data, share the progress with people as it comes along, and then put together the final analysis when all the data is collected, without having to manually re-write all the statistics and copy and paste all the graphs all the way through. I’ll post about the usefulness of Sweave and how it helps with workflow another time.