useR conference update

I’m at the useR! conference so I’ll be blogging every day with at least one thing that I learned.

The first thing, which I think I half-knew and had also half-learned from bitter experience, is that all the R experts seem to use Linux, Ubuntu in the case of the two people who ran the tutorials I attended today. I already dual-boot Windows and Linux and I think over time I’m going to reduce Windows to the operating system on which I play games and check my work email (which I can only get to run on Windows due to security software requirements).

The second thing is a neat way to plot regression when the outcome is binary. I’ve often wondered how I can visualise what’s going on when you have a horrible graph that looks like this:

It’s very simple, and I now know thanks to Douglas Bates’s excellent lme4 tutorial. Draw a graph like this (points are suppressed, you can include them if you want):

Just a few lines of code for the whole kit and caboodle:

library(lattice)

Outcome=sample(0:1, 100, replace=TRUE) # simulate data

Predictor=runif(100)*100 # simulate data

plot(Predictor, Outcome) # ugly graph

xyplot(Outcome~ Predictor, type = c("g", "smooth"), ylab = "Outcome", xlab = "Predictor") # useful graph

You can simulate the data properly so that there is an actual correlation if you want to (e.g. here) but I thought you’d spare me the bother- you get the idea.

Oh yes, I did learn one other thing today- they’re called packages, not libraries. One for the pedants among you.

How do interest rates affect the way my mortgage is paid off?

With a baby on the way my wife and I have become very interested in the interest rate on our mortgage, and how it might go up or down, and how that will affect whether we overpay to build up some equity, etc. etc. etc. As you will know if you’ve ever thought about your mortgage payments, it’s all quite complicated and difficult to think about.

For a bit of fun I thought I would produce a graph which summarises the value of the mortgage over time as well as the proportion of the money which is spent on capital and interest payments. The repayment is fixed for a given value of interest, so a stacked barchart will have a fixed height, with a different proportion of the bar coming from capital and interest payments over time.

I was going to make it a dynamic graph in which you could change the interest rate with a slider (using the excellent RStudio manipulate library) but having worked through it it’s a bit more complicated than I thought- I’ll do this in a subsequent post because I’m dying to give the manipulate library a try. For now I’ve produced the code and the accompanying graph based on a mortgage of £150,000 with an interest rate of 4%.

The code, which really is very simple and owes a big debt to this wonderful post, follows:

# P = principal, the initial amount of the loan
# I = the annual interest rate (from 1 to 100 percent)
# L = length, the length (in years) of the loan, or at least the length over which the loan is amortized.
# 
# J = monthly interest in decimal form = I / (12 x 100)
# N = number of months over which loan is amortized = L x 12
# Monthly payment = M = P * ( J / (1 - (1 + J) ^ -N))
# 
# Step 1: Calculate H = P x J, this is your current monthly interest
# Step 2: Calculate C = M - H, this is your monthly payment minus your monthly interest, so it is the amount of principal you pay for that month
# Step 3: Calculate Q = P - C, this is the new balance of your principal of your loan.
# Step 4: Set P equal to Q and go back to Step 1: You thusly loop around until the value Q (and hence P) goes to zero. 

# set variables

P = 150000
I = 4
L = 25
J = I / (12 * 100)
N = L * 12
M = P * ( J / (1 - (1 + J) ^ -N))

# make something to store the values in

Capital=numeric(300)
Interest=numeric(300)
Principal=numeric(300)

# loop

for (i in 1:300) {

H= P * J
C= M-H
Q= P- C
P= Q

Capital[i]=C
Interest[i]=H
Principal[i]=Q

}

# plot

par(cex=.7)

barplot(matrix(rbind(Capital[seq(1, 300, 12)], Interest[seq(1, 300, 12)]), ncol=25), xaxt="n", yaxt="n", col=c("green", "red"), xlim=c(1, 33), bty="n", main=paste("Monthly payment £", round(M), sep=""))
legend("topright", c("Capital payment", "Interest payment"), fill=c("green", "red"), bty="n")

par(new=TRUE)

plot(seq(1, 300, 12), Principal[seq(1, 300, 12)], xlab="Year", ylim=c(0,160000), ylab="Remaining loan amount", xaxt="n", xlim=c(1, 330), bty="n")
axis(1, at=seq(1, 300, 12), labels=1:25)

And here’s the plot (click to enlarge)!

New book- The visual display of quantitative information

Took delivery of a new book today, Tufte’s Visual Display of Quantitative Information. Obviously it’s a classic but I had no idea what a beautiful book it is. Camera doesn’t do it justice:

Full of wonderful illustrations from throughout the ages as well.

I’ll cover this and some of my favourite books past and present in future posts.

Fun with wordclouds

As always, I’m late to this party, and wordclouds have come under fire in recent times, e.g. here: drewconway.com. From my point of view they’re eye-catching, and I hope that by putting them up on a website or in a report they might cause people to linger and look in more detail at other pieces of data and visualisation. That’s all I’m going to say for now, I’m sure I’ll talk again about what’s attractive to data scientists and statisticians and what’s attractive to the general public, but let’s leave it for now.

I am looking at interesting ways of looking at the patient survey (see previous post) at the moment and I thought I would have a go at a wordcloud. Thanks to the wonderful people producing packages for R (the tm and wordcloud packages, many thanks to both!), it’s easy. I nicked a bit of code from another blog (thanks One R tip a day!) and pretty soon I had my own. It’s from two areas of the Trust, featuring the things people like about our services.

Here’s the code:


mydata=subset(mydatafirst, Var1==0) # slice the data by area

mycorpus=Corpus(DataframeSource(data.frame(mydata[-is.na(mydata$Best),34]))) # make a text corpus for the tm package out of a data frame, removing missing values at the same time

mycorpus <- tm_map(mycorpus, removePunctuation) # remove all the puncuation
mycorpus <- tm_map(mycorpus, tolower) # make everything lower case
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, c(stopwords("english"), "none", "ive", "dont", "etc"))) # remove words without meaning, many of which are provided by the stopwords("english") function

# these next steps take the corpus and turn it into a dataframe ready for the wordcloud function

tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

par(mfrow=c(1,2))

wordcloud(d$word,d$freq,c(3.5,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c("sans serif","plain"))

# ...same steps again for the other area

wordcloud(d2$word,d2$freq,scale=c(4,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c("sans serif","plain"))

I adjusted by eye the scale which you can see in the 3rd parameter of the wordcloud function (making the second area a bit larger). There’s probably a better way, I will investigate further as I look more at what to do with all this data.

Here’s the word cloud:

Rplot02

Not bad for a few hours’ work! I’m hoping it will draw people in to look at more of the reporting that we do at any rate. Let me know what you think of it, and word clouds generally, in the comments.

The patient survey: past, present and future

Within my organisation we have something known as the Service User and Carer Experience Survey, often abbreviated to the SUCE, or in more natural spoken English, the patient survey. It’s a chance for the users of our services to tell us about our services and includes Likert-type “ticky box” questions about specific aspects of care as well as an open section in which they can give an area for improvement as well as something they like about the service. Results are published quarterly and reporting represents a major challenge- we report over about 100 teams, all organised into larger directorates and divisions, with each category having its own type of report. I’ve been involved right from the start and the process goes nicely from a How-Not-To-Do-It guide to a How-To guide.

At first I was responsible for all data entry and reporting and in my naïveté I did everything manually. I made a page in Microsoft Excel which automatically generated frequency tables from raw data, and then fed those values into pie charts (yes I know: juiceanalytics.com/writing/the-problem-with-pie-charts but I’m afraid many find them accessible, possibly due to familiarity? Or just overall “friendly” appearance?). I made the bar graphs in SPSS. The whole thing took ages and was done quarterly. The first step on the road to getting the computer to shoulder this massive burden was writing some R (The R project) code which produced the graphs. It was very easy to do, I was a real novice back then, and here it is:


# select the current team, denoted x, and select the most recent mydata for the pie charts (put into Subset.P). Put data for the team from all points of time into Subset.B, which will be used to draw the barcharts 

Subset.P=subset(mymydata, mydata$Code==x & mydata$Time==8)
Subset.B=subset(mymydata, mydata$Code==x)

# add labels to each variable, and calculate the percentages in each category

Service.P=table(factor(Subset.P$Service, levels=1:6, labels=c("Very poor", "Poor", "Fair", "Good", "Very good", "Excellent")))
Service.P=Service.P/sum(Service.P)
names(Service.P) <- paste(names(Service.P), "-", round(Service.P*100), "%", sep="")

# this bit fixes the colours and steps through each category removing the colours from the empty categories. Without this step the colours aren't consistent if there are any empty categories

attributes(Service.P)$cols <- rainbow(6)
    if(0 %in% Service.P){
        ind <- which(Service.P == 0)
        Service.P <- Service.P[-ind]
        attributes(Service.P)$cols <- rainbow(6)[-ind]
    }

# and so on for each question

Comm.P=table(factor(Subset.P$Communication, levels=1:6, labels=c("Very poor", "Poor", "Fair", "Good", "Very good", "Excellent")))
Comm.P=Comm.P/sum(Comm.P)
names(Comm.P) <- paste(names(Comm.P), "-", round(Comm.P*100), "%", sep="")
attributes(Comm.P)$cols <- rainbow(6)
    if(0 %in% Comm.P){
        ind <- which(Comm.P == 0)
        Comm.P <- Comm.P[-ind]
        attributes(Comm.P)$cols <- rainbow(6)[-ind]
}

#... I've edited the repetitions of the above out to save space

# draw all the charts in a 3 by 2 grid ready for copying straight into the report (the last graph is drawn later)

par(mfrow=c(3,2))

pie(Service.P, clockwise=TRUE, main = c("Service quality"), radius=1, cex=1.1, init.angle=45, col=attr(Service.P, "cols"))
pie(Comm.P, clockwise=TRUE, main = c("Communication"), radius=1, cex=1.3, init.angle=45, col=attr(Comm.P, "cols"))
pie(Ldign.P, clockwise=TRUE, main = c("Dignity and respect"), radius=1, cex=1.3, init.angle=45, col=attr(Ldign.P, "cols"))
pie(Involved.P, clockwise=TRUE, main = c("Involved in care"), radius=1, cex=1.3, init.angle=45, col=attr(Involved.P, "cols"))
pie(Improved.P, clockwise=TRUE, main = c("Improved life"), radius=1, cex=1.3, init.angle=45, col=attr(Improved.P, "cols"))

# take the Subset.B data, which contains all data for each team across each time point, as opposed to the Subset.P which is just this quarter's data for the pie charts

# also multiply the figures to fix them all at a maximum of 5

Subset.B$Service=Subset.B$Service*5/6
Subset.B$Communication=Subset.B$Communication*5/6
Subset.B$Ldign=Subset.B$Ldign*5/3
Subset.B$Involved=Subset.B$Involved*5/3

# produce a matrix, ybar, which will hold the responses for each
# question at 5 time points- last year and the next 4 quarters

ybar=as.data.frame(matrix(data=NA, nrow=5, ncol=5))
names(ybar)=c("Apr - Mar 10", "Apr - Jun 10", "Jul - Sept 10", "Oct - Dec 10", "Jan - Mar 11")

# construct each one from the mean at time < 5, time==6, time==7 etc.

ybar[1,] = c(mean(subset(Subset.B, Time<5)$Service, na.rm=TRUE),
             mean(subset(Subset.B, Time==5)$Service, na.rm=TRUE),
             mean(subset(Subset.B, Time==6)$Service, na.rm=TRUE),
             mean(subset(Subset.B, Time==7)$Service, na.rm=TRUE),
             mean(subset(Subset.B, Time==8)$Service, na.rm=TRUE))

ybar[2,] = c(mean(subset(Subset.B, Time,5)$Communication, na.rm=TRUE),
             mean(subset(Subset.B, Time==5)$Communication, na.rm=TRUE),
             mean(subset(Subset.B, Time==6)$Communication, na.rm=TRUE),
             mean(subset(Subset.B, Time==7)$Communication, na.rm=TRUE),
             mean(subset(Subset.B, Time==8)$Communication, na.rm=TRUE))

ybar[3,] = c(mean(subset(Subset.B, Time<5)$Ldign, na.rm=TRUE),
             mean(subset(Subset.B, Time==5)$Ldign, na.rm=TRUE),
             mean(subset(Subset.B, Time==6)$Ldign, na.rm=TRUE),
             mean(subset(Subset.B, Time==7)$Ldign, na.rm=TRUE),
             mean(subset(Subset.B, Time==8)$Ldign, na.rm=TRUE))

ybar[4,] = c(mean(subset(Subset.B, Time<5)$Involved, na.rm=TRUE),
             mean(subset(Subset.B, Time==5)$Involved, na.rm=TRUE),
             mean(subset(Subset.B, Time==6)$Involved, na.rm=TRUE),
             mean(subset(Subset.B, Time==7)$Involved, na.rm=TRUE),
             mean(subset(Subset.B, Time==8)$Involved, na.rm=TRUE))

ybar[5,] = c(mean(subset(Subset.B, Time&lt;5)$Improved, na.rm=TRUE),
             mean(subset(Subset.B, Time==5)$Improved, na.rm=TRUE),
             mean(subset(Subset.B, Time==6)$Improved, na.rm=TRUE),
             mean(subset(Subset.B, Time==7)$Improved, na.rm=TRUE),
             mean(subset(Subset.B, Time==8)$Improved, na.rm=TRUE))

# remove all the empty bits, to prevent the graph containing ugly holes

if(is.na(ybar[,1])|is.na(ybar[,2])|is.na(ybar[,3])|is.na(ybar[,4])){
ybar=ybar[-which(is.nan(ybar[1,])==TRUE)]
}

# draw the graph, the enormous ylim value is to give room at the top for the legend 

barplot(as.matrix(ybar), beside=T, col=rainbow(5), ylim=c(0,9), yaxt = "n")
axis(2, at = 0:5)
legend("top", legend=c("Service quality", "Communication", "Dignity and respect",
                       "Involved with care", "Improved life"), fill=rainbow(5), bty="n", cex=1, ncol = 2 )

# now manually change the value of x to go to the next team and re-run the code

This greatly sped up the process of producing the graphs, but left a lot of the process in the hands of a human- figuring out which teams were reporting that quarter, running the code, pasting the graphs, counting the responses, etc. etc. As I’ve grown more confident I’ve given more and more of these tasks to the computer and saved more and more time. I’ll show some more of the steps in subsequent posts, leading up to where we are now and plans for the future.

Welcome to the blog

Inspired by the many excellent blogs about research, statistics, and computing I have finally decided to set up my own. My work for Nottinghamshire Healthcare NHS Trust and the Institute of Mental Health comprises advice and consultancy, research and evaluation, and data visualisation. I like to use R for my statistics, and have recently started making reproducible research using Sweave and odfWeave. Over time I have found I need to write more and more code and I am studying Java in October to help me on the way to being a proper programmer.

Expect posts about all of these things, and miscellany relating to statistics, data science and healthcare. To get started I’m going to do some retrospective posts over the next few weeks about what brought me here, some of my project work, and where I’m going over the next year.

In this and all subsequent posts the content reflects my own beliefs and opinions, not those of Nottinghamshire Healthcare NHS Trust, the NHS, or the Institute of Mental health or any of its partners.