Stupidest pie chart of the year 2011

I’ve just been looking at the Stupidest bar chart of the year 2011 and I’ve been inspired to submit my Stupidest pie chart of the year 2011. I won’t say from where I obtained it and I’ve drawn the data myself to save from annoying the original authors. And here it is (click to enlarge):

It represents the amount of maternity and paternity leave available in different countries. Pie charts are often not a very good idea, but in this case they are the worst possible idea. Pie charts should be used when the whole pie has some meaning. A whole customer base, individuals living in Nottinghamshire, something like that. The whole of these pies represents- what? The total amount of leave in these countries. This has no real world meaning at all, and the whole point of the pie chart is lost.

Even worse, underneath the pie chart they are forced to write “Spare a thought for parents in the USA and Sierra Leone… paid maternity leave 0 weeks, paid paternity leave 0 weeks.” because you cannot represent 0 on a pie chart! This should have set alarm bells ringing. One better way to plot these data:

This is absolutely factory-fresh out of the box settings, there are many ways to improve this plot and other types of plots which could be used. This plot improves on the previous one by:

1. Better able to compare levels of leave in each country
2. Better able to compare levels of each type of leave
3. No need for data labels which spell out number of weeks in each country and contribute to very low data:ink ratio
4. Able to display zero points which puts the marginal notes about USA and Sierra Leone on the plot!

Code for both:


par(mfrow=c(1, 2))

maternity=c(18, 16, 39, 17, 16, 0, 0)
paternity=c(3, 3, 14, 3, 14, 0, 0)
country=c("China", "Holland", "UK", "Greece", "France", "USA", "Sierra Leone")

pie(maternity, labels=paste(country, maternity, "weeks"), main="Maternity leave")

pie(paternity, labels=paste(country, paternity, "weeks"), main="Paternity leave")

# 2 minute bar chart

par(mar=c(10, 4, 4, 2) + 0.1)

barplot(maternity, names.arg=country, ylab="Maternity leave in weeks", las=3)

barplot(paternity, names.arg=country, ylab="Paternity leave in weeks", las=3)

New Year’s resolutions

It’s that time of year again, so what do I need to keep doing or do differently this year?

1. Keep being organised with Evernote and Getting Things Done. I found recently that I was processing so many bits of paper and information that I just gave up. Only the most important stuff got stored in my brain, the rest was discarded. Having started using Evernote and guided by Getting Things Done I have a massive brain pack to store everything I can’t use at the moment.

2. Produce reproducible reports. I think it’s a very important area of growth in statistical analysis and, moreover, I think it will make my work better (for a start, I’ll spend much longer commenting code and making it readable if it’s being published).

3. Learn something new. I need to learn various bits and pieces relating to interactive graphics this year (Tcl/ TK? Java? HTML5?) but I need to expand my statistical knowledge too. Likely candidates include bootstrapping and Bayesian analysis.

4. Publish, publish, publish. I’m sitting on too many datasets and pieces of work that should be published. Blogging is a good way of sharing small pieces of work and progress but it doesn’t require the care and attention which academic publication entails. I must publish more.

Complex surveys in R

I carried out what I thought was going to be a very simple piece of work the other day, slicing up some of the questions from a publicly available dataset in order to compare Nottinghamshire Healthcare NHS Trust with some other sites based on particular features of interest. I will post that work some other time, for now for those who are interested the source code is here.

In the meantime, I thought I would share my experience with the book which helped me through this work, which is here.

Complex surveys: A guide to analysis using R, which accompanies the survey package (or vice versa?), was a real joy for me to use. As the author explains in the book, complex survey designs are everywhere, and I confess I have been rather ignoring them in my work. There is just the right mix of code, equations, explanation, and some truly wonderful graphics in this book to equip you to deal with them.

So don’t be frightened of complex survey designs! Beg, borrow, or buy this book and get analysing!

Quick presentations

Just a quick one from me today, lots of deadlines colliding, which is handy because that’s what this post is about. I’ve been experimenting with the beamer class for LaTeX, with Sweave, which is another way of saying that I’ve been trying to save time producing slides for presentations directly from code as analysed by R.

The syntax is very simple, open a document with:

documentclass{beamer}
usetheme{Berlin}
title{Title goes here}
author{Chris Beeley}
date{today}
begin{document}

Slides are included with:

begin{frame}
section{Main findings}
… slide content goes here
end{frame}

The main strength, of course, is the ability to include graphics straight from R, and particularly in the analysis I’m doing today, because I’m pulling a lot of variables from an SPSS spreadsheet and making comparisons between the national figures and those for my areas. A lot of the code is the same for each graph.

<<fig=TRUE, echo=FALSE, height=4, width=7>>=

par(mar=c(10, 4, 4, 2) + 0.1)

temp1=apply(maindata[which(names(maindata)=="dhelp1.01"):
which(names(maindata)=="dhelp1.10")], 2, 
function(m) prop.table(table(m))[2])

temp2=apply(subset(maindata, region=="East Midlands")[which
(names(maindata)=="dhelp1.01"): which(names(maindata)
=="dhelp1.10")], 2, function(m) prop.table(table(m))[2])

graphmat=rbind(temp1, temp2)*100

colnames(graphmat)=c("Personal care", "Physical help",
 "Services/ benefits", "Paperwork/ financial",
 "Other practical", "Company", "Taking out", 
"Giving medicines", "Keeping an eye", "Other")

barx=barplot(graphmat, beside=TRUE, ylim=c(0,100), col=c("red", "green"), las=3)

text(barx[1,], graphmat[1,]+5, paste(round(graphmat[1,]), "%"), cex=.5)
text(barx[2,], graphmat[2,]+5, paste(round(graphmat[2,]), "%"), cex=.5)

legend("topright", c("National", "East Midlands"), fill=c("red", "green"))

@

All I need to do is update the names of the variables in the which(names(mydata)==”x”) argument and then put some labels in which will be drawn on the x-axis. I have about 10 of these graphs to make. Just thinking about doing it all by hand makes me feel tired. And if I do decide to change some of the axis labels, or add to or take away from the variables in the graph, it’s just a question of quickly editing the code, and then recompiling to a pdf. Magic.

Simulating data

I try to publish as much of my work as I can on the internet, where the people who own the data agree (my little bit of the website is here) but very often I can’t publish the data because of issues with ongoing projects, other publications, various levels of confidentiality of data, and so on.

So I’ve decided to try to simulate the data whenever there is an issue with publication. It’s very easy and it’s an excellent way for me to be able to show people my work without any of the issues that come from showing real data. The code is below, it comes in two parts for the first set of variables because we fiddled with the questionnaire a bit after Time 8 in the dataset.


mydatafirst[mydatafirst$Time>8,
  which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time>8, which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")],
      2, function (x) sample(x[!is.na(x)], length(mydatafirst$Service[mydatafirst$Time>8]), replace=TRUE))

mydatafirst[mydatafirst$Time<9,
  which(names(mydatafirst)=="Service"):which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time<9,which(names(mydatafirst)=="Service") : 
      which(names(mydatafirst)=="Therapist")],
      2, function (x) sample(x[!is.na(x)], length(mydatafirst$Service[mydatafirst$Time<9]), replace=TRUE))

mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))] =
  apply(mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))], 2, function (x)
    sample(x[!is.na(x)], length(mydatafirst$Imp1), replace=TRUE))

Thanks to the magic of R, and in particular the Sweave and Brew packages, all I need to do is insert these four lines into the code, re-run the report, and I have a nicely simulated dataset. I must confess I didn’t use R to convert the comments to gibberish, it was easier to download them from here, but if this website didn’t exist then I certainly could have used R to do this very easily.

Something else that R and Sweave are really helping me with at the moment is making it possible to start to analyse data and compile reports before the data comes in. Because Sweave will automatically put together the statistics and graphs for me as I go along, it frees me up to just work on the data, share the progress with people as it comes along, and then put together the final analysis when all the data is collected, without having to manually re-write all the statistics and copy and paste all the graphs all the way through. I’ll post about the usefulness of Sweave and how it helps with workflow another time.

Robin Hood marathon results

I ran the Robin Hood marathon yesterday in a decent-ish 4 hours and 13 minutes, which is my best yet. Naturally, I was curious to see how my fellow runners fared, and so I have scraped the times from a pdf and summarised them using R and ggplot2.

I ran to support the Disaster Emergency Committee, because of the East Africa Appeal, so if you would like to support this very worthy cause then please go here.

Data scraping, for those that do not know, is the process of taking human-readable files like pdfs and webpages and turning them into computer-readable files like spreadsheets (more here). The scraping itself was very simple since the pdf copy-pasted very nicely into a spreadsheet, which then read into R as a one variable list like so:

1 10038 Carl Allwood M Sutton & Ashfield Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25

The trick was merely to split up these big long strings and separate them into the correct variables, which, reading across, are:

Gun position (i.e. official position), race number, Name, Gender, Athletics club, Gun time (i.e. official time), Chip position and Chip time.

Chip position and chip time are the “real” time for slowcoaches such as myself, since it can take up to 10 minutes for all 15,000 runners to cross the line after the gun has gone- a chip therefore reads the time as you cross the start and finish line.

Code is at the bottom for those who are interested (and I would like to acknowledge the author of this post from whom I stole the “seconds” function to convert the times into a numeric format).

All times for finishers is shown below. My own time is represented by a vertical red line, with the median time being a black and dashed vertical line.

Next up is the difference between male and female finishers, with medians for each group given as vertical lines.

And lastly, a faceted plot showing the differences between different ages and genders. I recoded some of the age categories because they vary across genders which makes a mess of the faceting.

The code:

library(ggplot2)
library(car)

mydata=read.csv("D:\Dropbox\R-files\Marathon\Marathon_times.csv", stringsAsFactors=FALSE)

mylist1=strsplit(mydata$Var, "")

# find position, name and gender for all rows

mydata$Gunpos=lapply(mylist1, function(x) x[1])
mydata$Name=lapply(mylist1, function(x) x[3:4])
mydata$Gender=lapply(mylist1, function(x) x[5])

mydata$Chiptime=lapply(mylist1, function(x) x[length(x)])
mydata$Chippos=lapply(mylist1, function(x) x[length(x)-1])
mydata$Guntime=lapply(mylist1, function(x) x[length(x)-2])

# find the rows where the age category is included, i.e. 6th column is numeric

mydata$Age=NA

myvec=unlist(lapply(mylist1, function(x) as.numeric(x[6])&gt;0))
myvec[is.na(myvec)]=FALSE

mydata$Age[myvec]=unlist(lapply(mylist1, function(x) x[6])[myvec])

mydata$Age[is.na(mydata$Age)]=18

str(mydata$Age)
mydata$Age=factor(mydata$Age)

mydata$Age2=recode(mydata$Age, "'18'='18+'; c('35', '40')='35+'; c('45', '50')='45+'; c('55', '60')='55+'; c('65', '70', '75')='65+'";)

# fix the people with 3 names whose columns are misaligned

mydata$Gender[mydata$Gender!="M" & mydata$Gender!="F"]=lapply(mylist1, function(x) x[6])[mydata$Gender!="M" & mydata$Gender!="F"]

# fix the three stragglers with four names

mydata$Gender[c(331, 422, 1043)]="M"

# make gender a nicely formatted factor

mydata$Gender=factor(unlist(mydata$Gender))

# the title snuck in at row 75, delete this

mydata=data.frame(mydata[-75,])

### format the time values

# function from http://mcfromnz.wordpress.com/2011/09/07/2011-perth-city-to-surf-stats/

seconds <- function(x){
  as.numeric(substr(x,1,2)) * 60 * 60 +
  as.numeric(substr(x,4,5)) * 60 +
  as.numeric(substr(x,7,8)) 
}

mydata$Finaltime=seconds(unlist(mydata$Chiptime))/3600

### summarise

# overall
ggplot(mydata, aes(Finaltime)) + 
   labs(colour = "Gender") + 
   geom_density(size=1)+
   xlim(2, 7.5)+ 
   geom_vline(xintercept = 4+13/60, col="red", lty=1) +
   geom_vline(xintercept = median(mydata$Finaltime), col="black", lty=2)
   

# by gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) +
   xlim(2, 7.5)+
   geom_vline(xintercept = median(subset(mydata, Gender=="F")$Finaltime), col="red", lty=1) +
   geom_vline(xintercept = median(subset(mydata, Gender=="M")$Finaltime), col="blue", lty=1)

# by age category and gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) + facet_wrap(~Age2)+
   xlim(2, 7.5)

Misleading means and medians

Over at this excellent blog there is an interesting discussion about times when means and medians can be deceptive, particularly in the case where two variables with equal means have very different distributions. I chimed in myself and mentioned some of the examples which I come across in my work. Here is a particularly egregious example, measurement of self-esteem in patients on psychiatric wards in England and Belgium-

England mean
3.4
Belgium mean
3.2
England median
3.3
Belgium median
3.2
England sd
1.2
Belgium sd
0.9

Looks pretty similar on the face of it. Let’s have a look at the actual distribution (click to enlarge).

Pretty different. Quite interesting to consider why the two are so different. It would appear on the face of it that the measure works better in Belgium, producing a nice normal distribution, and not so well in England, where many individuals are selecting the maximum response across all the items in the scale.

Too often, I think, we talk about non-normal distributions in terms of their median, when as you can see here, many sins can be hidden in this way. I don’t know why the self-esteem measure is behaving like this in England, but we haven’t finished with these data so look out for more on the blog as we have a more thorough look.

R code:

mygraph=function(x){
  
par(mfrow=c(1,2))  

a=subset(mydata, country==1)
b=subset(mydata, country==2)

print(&quot;England mean&quot;)
print(mean(a[,x], na.rm=TRUE))
print(&quot;Belgium mean&quot;)
print(mean(b[,x], na.rm=TRUE))

print(&quot;England median&quot;)
print(median(a[,x], na.rm=TRUE))
print(&quot;Belgium median&quot;)
print(median(b[,x], na.rm=TRUE))

print(&quot;England sd&quot;)
print(sd(a[,x], na.rm=TRUE))
print(&quot;Belgium sd&quot;)
print(sd(b[,x], na.rm=TRUE))

hist(a[,x], main=&quot;England&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(a[,x], na.rm=TRUE))
hist(b[,x], main=&quot;Belgium&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(b[,x], na.rm=TRUE))
}

mygraph(&quot;Selfesteem&quot;)

useR day four and beginner’s odfWeave

I’m a bit late with the report from the last day of the useR conference, I was very busy Thursday getting home and catching up with the housework on Friday. I once again favoured sessions about graphics and best in show must go to Easy interactive ggplots, particularly for a bit of a coding novice like myself. I’m going to take some of the ideas from this talk and use them at work, it will blow everyone’s mind (well, doing it for free with blow everyone’s mind, anyway) so a big thank you to Richie for sharing his ideas and his code.

Since I got home I’ve been manfully struggling with odfWeave again, so I’m going to give a quick guide for new starters like myself in the hope that someone’s Google search for “odfWeave problems” or the suchlike will point them here. It’s for beginners, so if you’re not a beginner then you may as well skip this bit and go for a walk or something.

1) DON’T use windows. Even if you can get the zip/ unzip bit working, which I did eventually, you’ll hit a problem when you want to use an older version of the XML library (which you have to, see below)- unless you’re going to compile it yourself. If you’re going to compile it yourself, you clearly have a different definition of the word “beginner” than I do, and congratulations you on being so modest.

2) DO downgrade your XML version (or if you haven’t installed it yet, never install 3.4 in the first place). It’s dead easy to do, just go here and download the 3.2 version, which works beautifully. (use the install.packages() command with a pointer to the file, or if you use the fabulous RStudio, which I deeply love, then they have an option to use a downloaded file after their install packages button).

I think you should be pretty okay after that. I still got confused by some of the code snippets, so let’s just have a quick round-up of stuff you might want to do and how to do it.

Inline code is just like Sweave, so it’s

Sexpr{version$version.string}

Blocks of code, also, are just like Sweave. Just in case you get in a muddle, which to be honest I did:

1) Code you want to print (for presentations etc.:

&lt;&lt;echo=TRUE&gt;=
rnorm(100)
@

Code chunk finished with @

2) Code you want a graphic out of:

&lt;&lt;fig=TRUE, echo=FALSE&gt;&gt;=
hist(rnorm(100))
@

DON’T forget to encapsulate ggplot2 commands like ggplot and qplot in a print() because of some technical reason that essentially means “that’s just how you do it”. Code chunk finished with @

3) Code you want a table out of:

&lt;&lt;Table1,echo=FALSE,results=xml&gt;&gt;=
tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c(&quot;&quot;, &quot;Control&quot;, &quot;Treatment&quot;))
@

You’ll use the odfTable command, which does NOT work on table() objects, just turn them into a matrix or a dataframe and they work fine. Notice how I’ve put the row and column names in as well, it does mention it in the help file, but that’s how you do it anyway. Code chunk finished with @

I think that should be enough to get you started. I wish I’d read this post about 3 months ago, because I’ve been fiddling with odfWeave on and off since then (I started on Windows, I think that was my main mistake really, couldn’t get the zip and unzip commands to work for ages).

Here’s the first bit of the code from a report I’m writing to give you an idea.

Age
This was compared by comparing the distribution and median of the age of individuals in each group.

&lt;&lt;fig=FALSE, echo=FALSE&gt;&gt;=

rm(list=ls())
library(ggplot2)
library(car)

mydata=read.csv(&quot;/home/chris/Dropbox/R-files/GCAMT/TCComparison.csv&quot;)
mydata$Control=factor(mydata$Control, labels=c(&quot;Control&quot;, &quot;Treatment&quot;))

@

&lt;&lt;fig=TRUE, echo=FALSE&gt;&gt;=

# DOB

mydata$DOB2=as.Date(mydata$DOB, format=&quot;%Y-%m-%d&quot;, origin=&quot;1899-12-30&quot;)

print(

qplot(DOB2, data=mydata, geom=&quot;density&quot;, colour= Control, main=&quot;Median DOB marked blue for treatment, red for control&quot;, xlab=&quot;Date of birth&quot;, ylab=&quot;Density&quot;) + 
	geom_vline(xintercept = median(subset(mydata, Control==&quot;Treatment&quot;)$DOB-25569), col=&quot;blue&quot;, lty=2) + 
	geom_vline(xintercept = median(subset(mydata, Control==&quot;Control&quot;)$DOB-25569), col=&quot;red&quot;, lty=2) 

	)


@
Ethnicity
Ethnicity within treatment and control is shown below.

&lt;&lt;Table1,echo=FALSE,results=xml&gt;&gt;=
# &quot;Ethnicity&quot;

tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c(&quot;&quot;, &quot;Control&quot;, &quot;Treatment&quot;))

@

useR day three

The last post perhaps was a bit over-long on extraneous detail and bits of R-commands that most readers will either know backwards or have no interest in, and today’s learning sheet is over double the length of yesterday’s, so let’s really do one thing I learned today.

One thing I learned today was about the RTextTools package which sounds like a great way of classifying textual data. I’ve already had a go with some text stuff in a previous post but I hadn’t really thought about the whole data mining/ machine learning thing all that seriously until today. I do wonder whether this would have applications to the patient survey (900 responses a quarter, and actually now I think of it we have about 1500 from early in its life that were never classified). There is a person who codes the responses as they come in, and I don’t need or want to take that job off him, because clearly a person is always going to be better than a machine at classifying text, but if I could prototype it then one intriguing possibility would be raised. It would enable us to change the coding system. We set the whole thing up quite quickly and I did rather think we were stuck with the coding system forever (since we don’t want to recode 6000+ rows) but if the computer can do it reasonably well then we could change it as we saw fit. We could even have different coding systems for different parts of the Trust, all sorts of things. All interesting things to muse on.

It just goes to show what is possible when you use R and keep up to date with the amazing community. One of the presentations suggested that people do not always appreciate the work of the R-core team (who devote many thousands of hours completely gratis to R). For my part, I would say that I certainly do appreciate their efforts, moreover I find the whole story of R quite inspiring. R really was my introduction to the world of open-source, and there are some amazingly generous figures in the whole community, the R-core team not least among them, as well as other heroes from Linux, OpenOffice etc. Not only that but a countless army of minor heroes who share their work and source code every day. I’m rather too much of a novice to make a serious contribution but I hope one day to follow the example of these minor heroes and give something back, however small.

Day 2 of useR

Well, I said I was going to blog at least one thing I learned at the conference each day, today I’ve learned about 50 million things (yesterday was tutorials so I learned two things well; seen 20 presentations today, posters later).

To be honest I think this post is going to be more use to me in remembering everything than it will be to anyone else, but I’ll try to do it properly.

I learned about the StepAIC function, which is evidently some sort of stepwise regression which uses the AIC criterion. Someone mentioned it in passing but it sounds pretty useful. I also learned about Box-Cox plots (this is probably really embarrassing that I don’t know this, my PhD was psychology, I’m afraid I’ve learned a lot about stats after the fact, just getting the research and the thesis done was enough really in four years); they also sound pretty cool and useful.

I learned about Mosaic plots, the structable command, and relative multiple barcharts (which, from memory, are in the extracat package). If you don’t know these are useful for looking at 3-way and higher cross-tabs type tables. Looks pretty cool.

I learned about running R in the cloud all sorts of ways, the most interesting sounded like using RGoogleDocs and Apache and I wonder if I could get this working at work, will investigate further.

And last but not least I learned yet another way of automating the patient survey, this time using the Org mode of EMacs. So far I have encountered quite a bewildering array of different ways of achieving the same task:

1) Sweave
2) odfWeave
3) Sweave and brew (this is what I have used so far)
4) Org mode
5) R2wd

Each has various advantages, disadvantages, OS issues, etc. etc. etc. The code I’ve written is hugely long and complex and just generally awful. I’ll put a link to it here so I can confess my sins. I wrote it in one quarterly reporting cycle in a “make it work quickly” type way and it will definitely be more reliable and easier to develop if I re-write it properly this quarter.

The commenting of the code is also awful, I’m not going to go through and comment loads of dreadful code just for the blog, but I will sort that out in V 2.0 as well. Note also that they wanted an editable document, so I wrote the whole thing in Sweave, then converted the pdf to Word, which looked dreadful, and then rapidly re-wrote the whole thing to play nicely with LaTeX2rtf, which was incredibly useful.

The whole thing was brew-ed, then compiled to LaTeX. The graphics were done separately, as you can see. I think I’ll bring them in next time because that seems simpler somehow.

It’s very ugly but it did work. In future posts I’ll show how I’ve improved it and perhaps redeem myself a little bit.

For my sins