Simulating data

I try to publish as much of my work as I can on the internet, where the people who own the data agree (my little bit of the website is here) but very often I can’t publish the data because of issues with ongoing projects, other publications, various levels of confidentiality of data, and so on.

So I’ve decided to try to simulate the data whenever there is an issue with publication. It’s very easy and it’s an excellent way for me to be able to show people my work without any of the issues that come from showing real data. The code is below, it comes in two parts for the first set of variables because we fiddled with the questionnaire a bit after Time 8 in the dataset.

  which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time>8, which(names(mydatafirst)=="Service") : which(names(mydatafirst)=="Therapist")],
      2, function (x) sample(x[!], length(mydatafirst$Service[mydatafirst$Time>8]), replace=TRUE))

  which(names(mydatafirst)=="Service"):which(names(mydatafirst)=="Therapist")] = 
    apply(mydatafirst[mydatafirst$Time<9,which(names(mydatafirst)=="Service") : 
      2, function (x) sample(x[!], length(mydatafirst$Service[mydatafirst$Time<9]), replace=TRUE))

mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))] =
  apply(mydatafirst[c(which(names(mydatafirst)=="Imp1"), which(names(mydatafirst)=="Best1"))], 2, function (x)
    sample(x[!], length(mydatafirst$Imp1), replace=TRUE))

Thanks to the magic of R, and in particular the Sweave and Brew packages, all I need to do is insert these four lines into the code, re-run the report, and I have a nicely simulated dataset. I must confess I didn’t use R to convert the comments to gibberish, it was easier to download them from here, but if this website didn’t exist then I certainly could have used R to do this very easily.

Something else that R and Sweave are really helping me with at the moment is making it possible to start to analyse data and compile reports before the data comes in. Because Sweave will automatically put together the statistics and graphs for me as I go along, it frees me up to just work on the data, share the progress with people as it comes along, and then put together the final analysis when all the data is collected, without having to manually re-write all the statistics and copy and paste all the graphs all the way through. I’ll post about the usefulness of Sweave and how it helps with workflow another time.

Robin Hood marathon results

I ran the Robin Hood marathon yesterday in a decent-ish 4 hours and 13 minutes, which is my best yet. Naturally, I was curious to see how my fellow runners fared, and so I have scraped the times from a pdf and summarised them using R and ggplot2.

I ran to support the Disaster Emergency Committee, because of the East Africa Appeal, so if you would like to support this very worthy cause then please go here.

Data scraping, for those that do not know, is the process of taking human-readable files like pdfs and webpages and turning them into computer-readable files like spreadsheets (more here). The scraping itself was very simple since the pdf copy-pasted very nicely into a spreadsheet, which then read into R as a one variable list like so:

1 10038 Carl Allwood M Sutton & Ashfield Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25

The trick was merely to split up these big long strings and separate them into the correct variables, which, reading across, are:

Gun position (i.e. official position), race number, Name, Gender, Athletics club, Gun time (i.e. official time), Chip position and Chip time.

Chip position and chip time are the “real” time for slowcoaches such as myself, since it can take up to 10 minutes for all 15,000 runners to cross the line after the gun has gone- a chip therefore reads the time as you cross the start and finish line.

Code is at the bottom for those who are interested (and I would like to acknowledge the author of this post from whom I stole the “seconds” function to convert the times into a numeric format).

All times for finishers is shown below. My own time is represented by a vertical red line, with the median time being a black and dashed vertical line.

Next up is the difference between male and female finishers, with medians for each group given as vertical lines.

And lastly, a faceted plot showing the differences between different ages and genders. I recoded some of the age categories because they vary across genders which makes a mess of the faceting.

The code:


mydata=read.csv("D:\Dropbox\R-files\Marathon\Marathon_times.csv", stringsAsFactors=FALSE)

mylist1=strsplit(mydata$Var, "")

# find position, name and gender for all rows

mydata$Gunpos=lapply(mylist1, function(x) x[1])
mydata$Name=lapply(mylist1, function(x) x[3:4])
mydata$Gender=lapply(mylist1, function(x) x[5])

mydata$Chiptime=lapply(mylist1, function(x) x[length(x)])
mydata$Chippos=lapply(mylist1, function(x) x[length(x)-1])
mydata$Guntime=lapply(mylist1, function(x) x[length(x)-2])

# find the rows where the age category is included, i.e. 6th column is numeric


myvec=unlist(lapply(mylist1, function(x) as.numeric(x[6])&gt;0))

mydata$Age[myvec]=unlist(lapply(mylist1, function(x) x[6])[myvec])



mydata$Age2=recode(mydata$Age, "'18'='18+'; c('35', '40')='35+'; c('45', '50')='45+'; c('55', '60')='55+'; c('65', '70', '75')='65+'";)

# fix the people with 3 names whose columns are misaligned

mydata$Gender[mydata$Gender!="M" & mydata$Gender!="F"]=lapply(mylist1, function(x) x[6])[mydata$Gender!="M" & mydata$Gender!="F"]

# fix the three stragglers with four names

mydata$Gender[c(331, 422, 1043)]="M"

# make gender a nicely formatted factor


# the title snuck in at row 75, delete this


### format the time values

# function from

seconds <- function(x){
  as.numeric(substr(x,1,2)) * 60 * 60 +
  as.numeric(substr(x,4,5)) * 60 +


### summarise

# overall
ggplot(mydata, aes(Finaltime)) + 
   labs(colour = "Gender") + 
   xlim(2, 7.5)+ 
   geom_vline(xintercept = 4+13/60, col="red", lty=1) +
   geom_vline(xintercept = median(mydata$Finaltime), col="black", lty=2)

# by gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) +
   xlim(2, 7.5)+
   geom_vline(xintercept = median(subset(mydata, Gender=="F")$Finaltime), col="red", lty=1) +
   geom_vline(xintercept = median(subset(mydata, Gender=="M")$Finaltime), col="blue", lty=1)

# by age category and gender
ggplot(mydata, aes(Finaltime, colour = droplevels(Gender))) + 
   labs(colour = "Gender") + 
   geom_density(size=1) + facet_wrap(~Age2)+
   xlim(2, 7.5)

Misleading means and medians

Over at this excellent blog there is an interesting discussion about times when means and medians can be deceptive, particularly in the case where two variables with equal means have very different distributions. I chimed in myself and mentioned some of the examples which I come across in my work. Here is a particularly egregious example, measurement of self-esteem in patients on psychiatric wards in England and Belgium-

England mean
Belgium mean
England median
Belgium median
England sd
Belgium sd

Looks pretty similar on the face of it. Let’s have a look at the actual distribution (click to enlarge).

Pretty different. Quite interesting to consider why the two are so different. It would appear on the face of it that the measure works better in Belgium, producing a nice normal distribution, and not so well in England, where many individuals are selecting the maximum response across all the items in the scale.

Too often, I think, we talk about non-normal distributions in terms of their median, when as you can see here, many sins can be hidden in this way. I don’t know why the self-esteem measure is behaving like this in England, but we haven’t finished with these data so look out for more on the blog as we have a more thorough look.

R code:


a=subset(mydata, country==1)
b=subset(mydata, country==2)

print(&quot;England mean&quot;)
print(mean(a[,x], na.rm=TRUE))
print(&quot;Belgium mean&quot;)
print(mean(b[,x], na.rm=TRUE))

print(&quot;England median&quot;)
print(median(a[,x], na.rm=TRUE))
print(&quot;Belgium median&quot;)
print(median(b[,x], na.rm=TRUE))

print(&quot;England sd&quot;)
print(sd(a[,x], na.rm=TRUE))
print(&quot;Belgium sd&quot;)
print(sd(b[,x], na.rm=TRUE))

hist(a[,x], main=&quot;England&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(a[,x], na.rm=TRUE))
hist(b[,x], main=&quot;Belgium&quot;, xlab=&quot;Self esteem score&quot;, breaks=seq(0,5,by=.2), freq=FALSE)
lines(density(b[,x], na.rm=TRUE))


useR day four and beginner’s odfWeave

I’m a bit late with the report from the last day of the useR conference, I was very busy Thursday getting home and catching up with the housework on Friday. I once again favoured sessions about graphics and best in show must go to Easy interactive ggplots, particularly for a bit of a coding novice like myself. I’m going to take some of the ideas from this talk and use them at work, it will blow everyone’s mind (well, doing it for free with blow everyone’s mind, anyway) so a big thank you to Richie for sharing his ideas and his code.

Since I got home I’ve been manfully struggling with odfWeave again, so I’m going to give a quick guide for new starters like myself in the hope that someone’s Google search for “odfWeave problems” or the suchlike will point them here. It’s for beginners, so if you’re not a beginner then you may as well skip this bit and go for a walk or something.

1) DON’T use windows. Even if you can get the zip/ unzip bit working, which I did eventually, you’ll hit a problem when you want to use an older version of the XML library (which you have to, see below)- unless you’re going to compile it yourself. If you’re going to compile it yourself, you clearly have a different definition of the word “beginner” than I do, and congratulations you on being so modest.

2) DO downgrade your XML version (or if you haven’t installed it yet, never install 3.4 in the first place). It’s dead easy to do, just go here and download the 3.2 version, which works beautifully. (use the install.packages() command with a pointer to the file, or if you use the fabulous RStudio, which I deeply love, then they have an option to use a downloaded file after their install packages button).

I think you should be pretty okay after that. I still got confused by some of the code snippets, so let’s just have a quick round-up of stuff you might want to do and how to do it.

Inline code is just like Sweave, so it’s


Blocks of code, also, are just like Sweave. Just in case you get in a muddle, which to be honest I did:

1) Code you want to print (for presentations etc.:


Code chunk finished with @

2) Code you want a graphic out of:

&lt;&lt;fig=TRUE, echo=FALSE&gt;&gt;=

DON’T forget to encapsulate ggplot2 commands like ggplot and qplot in a print() because of some technical reason that essentially means “that’s just how you do it”. Code chunk finished with @

3) Code you want a table out of:

tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c(&quot;&quot;, &quot;Control&quot;, &quot;Treatment&quot;))

You’ll use the odfTable command, which does NOT work on table() objects, just turn them into a matrix or a dataframe and they work fine. Notice how I’ve put the row and column names in as well, it does mention it in the help file, but that’s how you do it anyway. Code chunk finished with @

I think that should be enough to get you started. I wish I’d read this post about 3 months ago, because I’ve been fiddling with odfWeave on and off since then (I started on Windows, I think that was my main mistake really, couldn’t get the zip and unzip commands to work for ages).

Here’s the first bit of the code from a report I’m writing to give you an idea.

This was compared by comparing the distribution and median of the age of individuals in each group.

&lt;&lt;fig=FALSE, echo=FALSE&gt;&gt;=


mydata$Control=factor(mydata$Control, labels=c(&quot;Control&quot;, &quot;Treatment&quot;))


&lt;&lt;fig=TRUE, echo=FALSE&gt;&gt;=


mydata$DOB2=as.Date(mydata$DOB, format=&quot;%Y-%m-%d&quot;, origin=&quot;1899-12-30&quot;)


qplot(DOB2, data=mydata, geom=&quot;density&quot;, colour= Control, main=&quot;Median DOB marked blue for treatment, red for control&quot;, xlab=&quot;Date of birth&quot;, ylab=&quot;Density&quot;) + 
	geom_vline(xintercept = median(subset(mydata, Control==&quot;Treatment&quot;)$DOB-25569), col=&quot;blue&quot;, lty=2) + 
	geom_vline(xintercept = median(subset(mydata, Control==&quot;Control&quot;)$DOB-25569), col=&quot;red&quot;, lty=2) 


Ethnicity within treatment and control is shown below.

# &quot;Ethnicity&quot;

tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c(&quot;&quot;, &quot;Control&quot;, &quot;Treatment&quot;))


useR day three

The last post perhaps was a bit over-long on extraneous detail and bits of R-commands that most readers will either know backwards or have no interest in, and today’s learning sheet is over double the length of yesterday’s, so let’s really do one thing I learned today.

One thing I learned today was about the RTextTools package which sounds like a great way of classifying textual data. I’ve already had a go with some text stuff in a previous post but I hadn’t really thought about the whole data mining/ machine learning thing all that seriously until today. I do wonder whether this would have applications to the patient survey (900 responses a quarter, and actually now I think of it we have about 1500 from early in its life that were never classified). There is a person who codes the responses as they come in, and I don’t need or want to take that job off him, because clearly a person is always going to be better than a machine at classifying text, but if I could prototype it then one intriguing possibility would be raised. It would enable us to change the coding system. We set the whole thing up quite quickly and I did rather think we were stuck with the coding system forever (since we don’t want to recode 6000+ rows) but if the computer can do it reasonably well then we could change it as we saw fit. We could even have different coding systems for different parts of the Trust, all sorts of things. All interesting things to muse on.

It just goes to show what is possible when you use R and keep up to date with the amazing community. One of the presentations suggested that people do not always appreciate the work of the R-core team (who devote many thousands of hours completely gratis to R). For my part, I would say that I certainly do appreciate their efforts, moreover I find the whole story of R quite inspiring. R really was my introduction to the world of open-source, and there are some amazingly generous figures in the whole community, the R-core team not least among them, as well as other heroes from Linux, OpenOffice etc. Not only that but a countless army of minor heroes who share their work and source code every day. I’m rather too much of a novice to make a serious contribution but I hope one day to follow the example of these minor heroes and give something back, however small.

Day 2 of useR

Well, I said I was going to blog at least one thing I learned at the conference each day, today I’ve learned about 50 million things (yesterday was tutorials so I learned two things well; seen 20 presentations today, posters later).

To be honest I think this post is going to be more use to me in remembering everything than it will be to anyone else, but I’ll try to do it properly.

I learned about the StepAIC function, which is evidently some sort of stepwise regression which uses the AIC criterion. Someone mentioned it in passing but it sounds pretty useful. I also learned about Box-Cox plots (this is probably really embarrassing that I don’t know this, my PhD was psychology, I’m afraid I’ve learned a lot about stats after the fact, just getting the research and the thesis done was enough really in four years); they also sound pretty cool and useful.

I learned about Mosaic plots, the structable command, and relative multiple barcharts (which, from memory, are in the extracat package). If you don’t know these are useful for looking at 3-way and higher cross-tabs type tables. Looks pretty cool.

I learned about running R in the cloud all sorts of ways, the most interesting sounded like using RGoogleDocs and Apache and I wonder if I could get this working at work, will investigate further.

And last but not least I learned yet another way of automating the patient survey, this time using the Org mode of EMacs. So far I have encountered quite a bewildering array of different ways of achieving the same task:

1) Sweave
2) odfWeave
3) Sweave and brew (this is what I have used so far)
4) Org mode
5) R2wd

Each has various advantages, disadvantages, OS issues, etc. etc. etc. The code I’ve written is hugely long and complex and just generally awful. I’ll put a link to it here so I can confess my sins. I wrote it in one quarterly reporting cycle in a “make it work quickly” type way and it will definitely be more reliable and easier to develop if I re-write it properly this quarter.

The commenting of the code is also awful, I’m not going to go through and comment loads of dreadful code just for the blog, but I will sort that out in V 2.0 as well. Note also that they wanted an editable document, so I wrote the whole thing in Sweave, then converted the pdf to Word, which looked dreadful, and then rapidly re-wrote the whole thing to play nicely with LaTeX2rtf, which was incredibly useful.

The whole thing was brew-ed, then compiled to LaTeX. The graphics were done separately, as you can see. I think I’ll bring them in next time because that seems simpler somehow.

It’s very ugly but it did work. In future posts I’ll show how I’ve improved it and perhaps redeem myself a little bit.

For my sins

useR conference update

I’m at the useR! conference so I’ll be blogging every day with at least one thing that I learned.

The first thing, which I think I half-knew and had also half-learned from bitter experience, is that all the R experts seem to use Linux, Ubuntu in the case of the two people who ran the tutorials I attended today. I already dual-boot Windows and Linux and I think over time I’m going to reduce Windows to the operating system on which I play games and check my work email (which I can only get to run on Windows due to security software requirements).

The second thing is a neat way to plot regression when the outcome is binary. I’ve often wondered how I can visualise what’s going on when you have a horrible graph that looks like this:

It’s very simple, and I now know thanks to Douglas Bates’s excellent lme4 tutorial. Draw a graph like this (points are suppressed, you can include them if you want):

Just a few lines of code for the whole kit and caboodle:


Outcome=sample(0:1, 100, replace=TRUE) # simulate data

Predictor=runif(100)*100 # simulate data

plot(Predictor, Outcome) # ugly graph

xyplot(Outcome~ Predictor, type = c(&quot;g&quot;, &quot;smooth&quot;), ylab = &quot;Outcome&quot;, xlab = &quot;Predictor&quot;) # useful graph

You can simulate the data properly so that there is an actual correlation if you want to (e.g. here) but I thought you’d spare me the bother- you get the idea.

Oh yes, I did learn one other thing today- they’re called packages, not libraries. One for the pedants among you.

How do interest rates affect the way my mortgage is paid off?

With a baby on the way my wife and I have become very interested in the interest rate on our mortgage, and how it might go up or down, and how that will affect whether we overpay to build up some equity, etc. etc. etc. As you will know if you’ve ever thought about your mortgage payments, it’s all quite complicated and difficult to think about.

For a bit of fun I thought I would produce a graph which summarises the value of the mortgage over time as well as the proportion of the money which is spent on capital and interest payments. The repayment is fixed for a given value of interest, so a stacked barchart will have a fixed height, with a different proportion of the bar coming from capital and interest payments over time.

I was going to make it a dynamic graph in which you could change the interest rate with a slider (using the excellent RStudio manipulate library) but having worked through it it’s a bit more complicated than I thought- I’ll do this in a subsequent post because I’m dying to give the manipulate library a try. For now I’ve produced the code and the accompanying graph based on a mortgage of £150,000 with an interest rate of 4%.

The code, which really is very simple and owes a big debt to this wonderful post, follows:

# P = principal, the initial amount of the loan
# I = the annual interest rate (from 1 to 100 percent)
# L = length, the length (in years) of the loan, or at least the length over which the loan is amortized.
# J = monthly interest in decimal form = I / (12 x 100)
# N = number of months over which loan is amortized = L x 12
# Monthly payment = M = P * ( J / (1 - (1 + J) ^ -N))
# Step 1: Calculate H = P x J, this is your current monthly interest
# Step 2: Calculate C = M - H, this is your monthly payment minus your monthly interest, so it is the amount of principal you pay for that month
# Step 3: Calculate Q = P - C, this is the new balance of your principal of your loan.
# Step 4: Set P equal to Q and go back to Step 1: You thusly loop around until the value Q (and hence P) goes to zero. 

# set variables

P = 150000
I = 4
L = 25
J = I / (12 * 100)
N = L * 12
M = P * ( J / (1 - (1 + J) ^ -N))

# make something to store the values in


# loop

for (i in 1:300) {

H= P * J
C= M-H
Q= P- C
P= Q



# plot


barplot(matrix(rbind(Capital[seq(1, 300, 12)], Interest[seq(1, 300, 12)]), ncol=25), xaxt=&quot;n&quot;, yaxt=&quot;n&quot;, col=c(&quot;green&quot;, &quot;red&quot;), xlim=c(1, 33), bty=&quot;n&quot;, main=paste(&quot;Monthly payment £&quot;, round(M), sep=&quot;&quot;))
legend(&quot;topright&quot;, c(&quot;Capital payment&quot;, &quot;Interest payment&quot;), fill=c(&quot;green&quot;, &quot;red&quot;), bty=&quot;n&quot;)


plot(seq(1, 300, 12), Principal[seq(1, 300, 12)], xlab=&quot;Year&quot;, ylim=c(0,160000), ylab=&quot;Remaining loan amount&quot;, xaxt=&quot;n&quot;, xlim=c(1, 330), bty=&quot;n&quot;)
axis(1, at=seq(1, 300, 12), labels=1:25)

And here’s the plot (click to enlarge)!

New book- The visual display of quantitative information

Took delivery of a new book today, Tufte’s Visual Display of Quantitative Information. Obviously it’s a classic but I had no idea what a beautiful book it is. Camera doesn’t do it justice:

Full of wonderful illustrations from throughout the ages as well.

I’ll cover this and some of my favourite books past and present in future posts.

Fun with wordclouds

As always, I’m late to this party, and wordclouds have come under fire in recent times, e.g. here: From my point of view they’re eye-catching, and I hope that by putting them up on a website or in a report they might cause people to linger and look in more detail at other pieces of data and visualisation. That’s all I’m going to say for now, I’m sure I’ll talk again about what’s attractive to data scientists and statisticians and what’s attractive to the general public, but let’s leave it for now.

I am looking at interesting ways of looking at the patient survey (see previous post) at the moment and I thought I would have a go at a wordcloud. Thanks to the wonderful people producing packages for R (the tm and wordcloud packages, many thanks to both!), it’s easy. I nicked a bit of code from another blog (thanks One R tip a day!) and pretty soon I had my own. It’s from two areas of the Trust, featuring the things people like about our services.

Here’s the code:

mydata=subset(mydatafirst, Var1==0) # slice the data by area

mycorpus=Corpus(DataframeSource(data.frame(mydata[$Best),34]))) # make a text corpus for the tm package out of a data frame, removing missing values at the same time

mycorpus &lt;- tm_map(mycorpus, removePunctuation) # remove all the puncuation
mycorpus &lt;- tm_map(mycorpus, tolower) # make everything lower case
mycorpus &lt;- tm_map(mycorpus, function(x) removeWords(x, c(stopwords(&quot;english&quot;), &quot;none&quot;, &quot;ive&quot;, &quot;dont&quot;, &quot;etc&quot;))) # remove words without meaning, many of which are provided by the stopwords(&quot;english&quot;) function

# these next steps take the corpus and turn it into a dataframe ready for the wordcloud function

tdm &lt;- TermDocumentMatrix(mycorpus)
m &lt;- as.matrix(tdm)
v &lt;- sort(rowSums(m),decreasing=TRUE)
d &lt;- data.frame(word = names(v),freq=v)


wordcloud(d$word,d$freq,c(3.5,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c(&quot;sans serif&quot;,&quot;plain&quot;))

# ...same steps again for the other area

wordcloud(d2$word,d2$freq,scale=c(4,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c(&quot;sans serif&quot;,&quot;plain&quot;))

I adjusted by eye the scale which you can see in the 3rd parameter of the wordcloud function (making the second area a bit larger). There’s probably a better way, I will investigate further as I look more at what to do with all this data.

Here’s the word cloud:


Not bad for a few hours’ work! I’m hoping it will draw people in to look at more of the reporting that we do at any rate. Let me know what you think of it, and word clouds generally, in the comments.