Robin Hood marathon results

I ran the Robin Hood marathon yesterday in a decent-ish 4 hours and 13 minutes, which is my best yet. Naturally, I was curious to see how my fellow runners fared, and so I have scraped the times from a pdf and summarised them using R and ggplot2.

I ran to support the Disaster Emergency Committee, because of the East Africa Appeal, so if you would like to support this very worthy cause then please go here.

Data scraping, for those that do not know, is the process of taking human-readable files like pdfs and webpages and turning them into computer-readable files like spreadsheets (more here). The scraping itself was very simple since the pdf copy-pasted very nicely into a spreadsheet, which then read into R as a one variable list like so:

1 10038 Carl Allwood M Sutton & Ashfield Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25

The trick was merely to split up these big long strings and separate them into the correct variables, which, reading across, are:

Gun position (i.e. official position), race number, Name, Gender, Athletics club, Gun time (i.e. official time), Chip position and Chip time.

Chip position and chip time are the “real” time for slowcoaches such as myself, since it can take up to 10 minutes for all 15,000 runners to cross the line after the gun has gone- a chip therefore reads the time as you cross the start and finish line.

Code is at the bottom for those who are interested (and I would like to acknowledge the author of this post from whom I stole the “seconds” function to convert the times into a numeric format).

All times for finishers is shown below. My own time is represented by a vertical red line, with the median time being a black and dashed vertical line.

Next up is the difference between male and female finishers, with medians for each group given as vertical lines.

And lastly, a faceted plot showing the differences between different ages and genders. I recoded some of the age categories because they vary across genders which makes a mess of the faceting.

The code:

library(ggplot2)
library(car)

mydata=read.csv("D:\Dropbox\R-files\Marathon\Marathon_times.csv", stringsAsFactors=FALSE)

mylist1=strsplit(mydata$Var, "")

# find position, name and gender for all rows

mydata$Gunpos=lapply(mylist1, function(x) x[1])
mydata$Name=lapply(mylist1, function(x) x[3:4])
mydata$Gender=lapply(mylist1, function(x) x[5])

mydata$Chiptime=lapply(mylist1, function(x) x[length(x)])
mydata$Chippos=lapply(mylist1, function(x) x[length(x)-1])
mydata$Guntime=lapply(mylist1, function(x) x[length(x)-2])

# find the rows where the age category is included, i.e. 6th column is numeric

mydata$Age=NA

myvec=unlist(lapply(mylist1, function(x) as.numeric(x[6])>0))
myvec[is.na(myvec)]=FALSE

mydata$Age[myvec]=unlist(lapply(mylist1, function(x) x[6])[myvec])

mydata$Age[is.na(mydata$Age)]=18

str(mydata$Age)
mydata$Age=factor(mydata$Age)

mydata$Age2=recode(mydata$Age, "'18'='18+'; c('35', '40')='35+'; c('45', '50')='45+'; c('55', '60')='55+'; c('65', '70', '75')='65+'";)

# fix the people with 3 names whose columns are misaligned

mydata$Gender[mydata$Gender!="M" & mydata$Gender!="F"]=lapply(mylist1, function(x) x[6])[mydata$Gender!="M" & mydata$Gender!="F"]

# fix the three stragglers with four names

mydata$Gender[c(331, 422, 1043)]="M"

# make gender a nicely formatted factor

mydata$Gender=factor(unlist(mydata$Gender))

# the title snuck in at row 75, delete this

mydata=data.frame(mydata[-75,])

### format the time values

# function from http://mcfromnz.wordpress.com/2011/09/07/2011-perth-city-to-surf-stats/

seconds <- function(x){
  as.numeric(substr(x,1,2)) * 60 * 60 +
  as.numeric(substr(x,4,5)) * 60 +
  as.numeric(substr(x,7,8)) 
}

mydata$Finaltime=seconds(unlist(mydata$Chiptime))/3600

### summarise

# overall
ggplot(mydata, aes(Finaltime)) + 
   labs(colour = "Gender") + 
   geom_density(size=1)+
   xlim(2, 7.5)+ 
   geom_vline(xintercept = 4+13/60, col="red", lty=1) +
   geom_vline(xintercept = median(mydata$Finaltime), col="black", lty=2)
   

# by gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) +
   xlim(2, 7.5)+
   geom_vline(xintercept = median(subset(mydata, Gender=="F")$Finaltime), col="red", lty=1) +
   geom_vline(xintercept = median(subset(mydata, Gender=="M")$Finaltime), col="blue", lty=1)

# by age category and gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) + facet_wrap(~Age2)+
   xlim(2, 7.5)

Misleading means and medians

Over at this excellent blog there is an interesting discussion about times when means and medians can be deceptive, particularly in the case where two variables with equal means have very different distributions. I chimed in myself and mentioned some of the examples which I come across in my work. Here is a particularly egregious example, measurement of self-esteem in patients on psychiatric wards in England and Belgium-

England mean
3.4
Belgium mean
3.2
England median
3.3
Belgium median
3.2
England sd
1.2
Belgium sd
0.9

Looks pretty similar on the face of it. Let’s have a look at the actual distribution (click to enlarge).

Pretty different. Quite interesting to consider why the two are so different. It would appear on the face of it that the measure works better in Belgium, producing a nice normal distribution, and not so well in England, where many individuals are selecting the maximum response across all the items in the scale.

Too often, I think, we talk about non-normal distributions in terms of their median, when as you can see here, many sins can be hidden in this way. I don’t know why the self-esteem measure is behaving like this in England, but we haven’t finished with these data so look out for more on the blog as we have a more thorough look.

R code:

mygraph=function(x){
  
par(mfrow=c(1,2))  

a=subset(mydata, country==1)
b=subset(mydata, country==2)

print(&quot;England mean&quot;)
print(mean(a[,x], na.rm=TRUE))
print(&quot;Belgium mean&quot;)
print(mean(b[,x], na.rm=TRUE))

print(&quot;England median&quot;)
print(median(a[,x], na.rm=TRUE))
print(&quot;Belgium median&quot;)
print(median(b[,x], na.rm=TRUE))

print(&quot;England sd&quot;)
print(sd(a[,x], na.rm=TRUE))
print(&quot;Belgium sd&quot;)
print(sd(b[,x], na.rm=TRUE))

hist(a[,x], main=&quot;England&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(a[,x], na.rm=TRUE))
hist(b[,x], main=&quot;Belgium&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(b[,x], na.rm=TRUE))
}

mygraph(&quot;Selfesteem&quot;)