useR day four and beginner’s odfWeave

I’m a bit late with the report from the last day of the useR conference, I was very busy Thursday getting home and catching up with the housework on Friday. I once again favoured sessions about graphics and best in show must go to Easy interactive ggplots, particularly for a bit of a coding novice like myself. I’m going to take some of the ideas from this talk and use them at work, it will blow everyone’s mind (well, doing it for free with blow everyone’s mind, anyway) so a big thank you to Richie for sharing his ideas and his code.

Since I got home I’ve been manfully struggling with odfWeave again, so I’m going to give a quick guide for new starters like myself in the hope that someone’s Google search for “odfWeave problems” or the suchlike will point them here. It’s for beginners, so if you’re not a beginner then you may as well skip this bit and go for a walk or something.

1) DON’T use windows. Even if you can get the zip/ unzip bit working, which I did eventually, you’ll hit a problem when you want to use an older version of the XML library (which you have to, see below)- unless you’re going to compile it yourself. If you’re going to compile it yourself, you clearly have a different definition of the word “beginner” than I do, and congratulations you on being so modest.

2) DO downgrade your XML version (or if you haven’t installed it yet, never install 3.4 in the first place). It’s dead easy to do, just go here and download the 3.2 version, which works beautifully. (use the install.packages() command with a pointer to the file, or if you use the fabulous RStudio, which I deeply love, then they have an option to use a downloaded file after their install packages button).

I think you should be pretty okay after that. I still got confused by some of the code snippets, so let’s just have a quick round-up of stuff you might want to do and how to do it.

Inline code is just like Sweave, so it’s

Sexpr{version$version.string}

Blocks of code, also, are just like Sweave. Just in case you get in a muddle, which to be honest I did:

1) Code you want to print (for presentations etc.:

<<echo=TRUE>=
rnorm(100)
@

Code chunk finished with @

2) Code you want a graphic out of:

<<fig=TRUE, echo=FALSE>>=
hist(rnorm(100))
@

DON’T forget to encapsulate ggplot2 commands like ggplot and qplot in a print() because of some technical reason that essentially means “that’s just how you do it”. Code chunk finished with @

3) Code you want a table out of:

<<Table1,echo=FALSE,results=xml>>=
tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c("", "Control", "Treatment"))
@

You’ll use the odfTable command, which does NOT work on table() objects, just turn them into a matrix or a dataframe and they work fine. Notice how I’ve put the row and column names in as well, it does mention it in the help file, but that’s how you do it anyway. Code chunk finished with @

I think that should be enough to get you started. I wish I’d read this post about 3 months ago, because I’ve been fiddling with odfWeave on and off since then (I started on Windows, I think that was my main mistake really, couldn’t get the zip and unzip commands to work for ages).

Here’s the first bit of the code from a report I’m writing to give you an idea.

Age
This was compared by comparing the distribution and median of the age of individuals in each group.

<<fig=FALSE, echo=FALSE>>=

rm(list=ls())
library(ggplot2)
library(car)

mydata=read.csv("/home/chris/Dropbox/R-files/GCAMT/TCComparison.csv")
mydata$Control=factor(mydata$Control, labels=c("Control", "Treatment"))

@

<<fig=TRUE, echo=FALSE>>=

# DOB

mydata$DOB2=as.Date(mydata$DOB, format="%Y-%m-%d", origin="1899-12-30")

print(

qplot(DOB2, data=mydata, geom="density", colour= Control, main="Median DOB marked blue for treatment, red for control", xlab="Date of birth", ylab="Density") + 
	geom_vline(xintercept = median(subset(mydata, Control=="Treatment")$DOB-25569), col="blue", lty=2) + 
	geom_vline(xintercept = median(subset(mydata, Control=="Control")$DOB-25569), col="red", lty=2) 

	)


@
Ethnicity
Ethnicity within treatment and control is shown below.

<<Table1,echo=FALSE,results=xml>>=
# "Ethnicity"

tempmat=matrix(with(mydata, table(Ethnicity, Control)), ncol=2)
row.names(tempmat)=row.names(with(mydata, table(Ethnicity, Control)))

odfTable(tempmat, useRowNames=TRUE, colnames=c("", "Control", "Treatment"))

@

useR day three

The last post perhaps was a bit over-long on extraneous detail and bits of R-commands that most readers will either know backwards or have no interest in, and today’s learning sheet is over double the length of yesterday’s, so let’s really do one thing I learned today.

One thing I learned today was about the RTextTools package which sounds like a great way of classifying textual data. I’ve already had a go with some text stuff in a previous post but I hadn’t really thought about the whole data mining/ machine learning thing all that seriously until today. I do wonder whether this would have applications to the patient survey (900 responses a quarter, and actually now I think of it we have about 1500 from early in its life that were never classified). There is a person who codes the responses as they come in, and I don’t need or want to take that job off him, because clearly a person is always going to be better than a machine at classifying text, but if I could prototype it then one intriguing possibility would be raised. It would enable us to change the coding system. We set the whole thing up quite quickly and I did rather think we were stuck with the coding system forever (since we don’t want to recode 6000+ rows) but if the computer can do it reasonably well then we could change it as we saw fit. We could even have different coding systems for different parts of the Trust, all sorts of things. All interesting things to muse on.

It just goes to show what is possible when you use R and keep up to date with the amazing community. One of the presentations suggested that people do not always appreciate the work of the R-core team (who devote many thousands of hours completely gratis to R). For my part, I would say that I certainly do appreciate their efforts, moreover I find the whole story of R quite inspiring. R really was my introduction to the world of open-source, and there are some amazingly generous figures in the whole community, the R-core team not least among them, as well as other heroes from Linux, OpenOffice etc. Not only that but a countless army of minor heroes who share their work and source code every day. I’m rather too much of a novice to make a serious contribution but I hope one day to follow the example of these minor heroes and give something back, however small.

Day 2 of useR

Well, I said I was going to blog at least one thing I learned at the conference each day, today I’ve learned about 50 million things (yesterday was tutorials so I learned two things well; seen 20 presentations today, posters later).

To be honest I think this post is going to be more use to me in remembering everything than it will be to anyone else, but I’ll try to do it properly.

I learned about the StepAIC function, which is evidently some sort of stepwise regression which uses the AIC criterion. Someone mentioned it in passing but it sounds pretty useful. I also learned about Box-Cox plots (this is probably really embarrassing that I don’t know this, my PhD was psychology, I’m afraid I’ve learned a lot about stats after the fact, just getting the research and the thesis done was enough really in four years); they also sound pretty cool and useful.

I learned about Mosaic plots, the structable command, and relative multiple barcharts (which, from memory, are in the extracat package). If you don’t know these are useful for looking at 3-way and higher cross-tabs type tables. Looks pretty cool.

I learned about running R in the cloud all sorts of ways, the most interesting sounded like using RGoogleDocs and Apache and I wonder if I could get this working at work, will investigate further.

And last but not least I learned yet another way of automating the patient survey, this time using the Org mode of EMacs. So far I have encountered quite a bewildering array of different ways of achieving the same task:

1) Sweave
2) odfWeave
3) Sweave and brew (this is what I have used so far)
4) Org mode
5) R2wd

Each has various advantages, disadvantages, OS issues, etc. etc. etc. The code I’ve written is hugely long and complex and just generally awful. I’ll put a link to it here so I can confess my sins. I wrote it in one quarterly reporting cycle in a “make it work quickly” type way and it will definitely be more reliable and easier to develop if I re-write it properly this quarter.

The commenting of the code is also awful, I’m not going to go through and comment loads of dreadful code just for the blog, but I will sort that out in V 2.0 as well. Note also that they wanted an editable document, so I wrote the whole thing in Sweave, then converted the pdf to Word, which looked dreadful, and then rapidly re-wrote the whole thing to play nicely with LaTeX2rtf, which was incredibly useful.

The whole thing was brew-ed, then compiled to LaTeX. The graphics were done separately, as you can see. I think I’ll bring them in next time because that seems simpler somehow.

It’s very ugly but it did work. In future posts I’ll show how I’ve improved it and perhaps redeem myself a little bit.

For my sins

useR conference update

I’m at the useR! conference so I’ll be blogging every day with at least one thing that I learned.

The first thing, which I think I half-knew and had also half-learned from bitter experience, is that all the R experts seem to use Linux, Ubuntu in the case of the two people who ran the tutorials I attended today. I already dual-boot Windows and Linux and I think over time I’m going to reduce Windows to the operating system on which I play games and check my work email (which I can only get to run on Windows due to security software requirements).

The second thing is a neat way to plot regression when the outcome is binary. I’ve often wondered how I can visualise what’s going on when you have a horrible graph that looks like this:

It’s very simple, and I now know thanks to Douglas Bates’s excellent lme4 tutorial. Draw a graph like this (points are suppressed, you can include them if you want):

Just a few lines of code for the whole kit and caboodle:

library(lattice)

Outcome=sample(0:1, 100, replace=TRUE) # simulate data

Predictor=runif(100)*100 # simulate data

plot(Predictor, Outcome) # ugly graph

xyplot(Outcome~ Predictor, type = c("g", "smooth"), ylab = "Outcome", xlab = "Predictor") # useful graph

You can simulate the data properly so that there is an actual correlation if you want to (e.g. here) but I thought you’d spare me the bother- you get the idea.

Oh yes, I did learn one other thing today- they’re called packages, not libraries. One for the pedants among you.

How do interest rates affect the way my mortgage is paid off?

With a baby on the way my wife and I have become very interested in the interest rate on our mortgage, and how it might go up or down, and how that will affect whether we overpay to build up some equity, etc. etc. etc. As you will know if you’ve ever thought about your mortgage payments, it’s all quite complicated and difficult to think about.

For a bit of fun I thought I would produce a graph which summarises the value of the mortgage over time as well as the proportion of the money which is spent on capital and interest payments. The repayment is fixed for a given value of interest, so a stacked barchart will have a fixed height, with a different proportion of the bar coming from capital and interest payments over time.

I was going to make it a dynamic graph in which you could change the interest rate with a slider (using the excellent RStudio manipulate library) but having worked through it it’s a bit more complicated than I thought- I’ll do this in a subsequent post because I’m dying to give the manipulate library a try. For now I’ve produced the code and the accompanying graph based on a mortgage of £150,000 with an interest rate of 4%.

The code, which really is very simple and owes a big debt to this wonderful post, follows:

# P = principal, the initial amount of the loan
# I = the annual interest rate (from 1 to 100 percent)
# L = length, the length (in years) of the loan, or at least the length over which the loan is amortized.
# 
# J = monthly interest in decimal form = I / (12 x 100)
# N = number of months over which loan is amortized = L x 12
# Monthly payment = M = P * ( J / (1 - (1 + J) ^ -N))
# 
# Step 1: Calculate H = P x J, this is your current monthly interest
# Step 2: Calculate C = M - H, this is your monthly payment minus your monthly interest, so it is the amount of principal you pay for that month
# Step 3: Calculate Q = P - C, this is the new balance of your principal of your loan.
# Step 4: Set P equal to Q and go back to Step 1: You thusly loop around until the value Q (and hence P) goes to zero. 

# set variables

P = 150000
I = 4
L = 25
J = I / (12 * 100)
N = L * 12
M = P * ( J / (1 - (1 + J) ^ -N))

# make something to store the values in

Capital=numeric(300)
Interest=numeric(300)
Principal=numeric(300)

# loop

for (i in 1:300) {

H= P * J
C= M-H
Q= P- C
P= Q

Capital[i]=C
Interest[i]=H
Principal[i]=Q

}

# plot

par(cex=.7)

barplot(matrix(rbind(Capital[seq(1, 300, 12)], Interest[seq(1, 300, 12)]), ncol=25), xaxt="n", yaxt="n", col=c("green", "red"), xlim=c(1, 33), bty="n", main=paste("Monthly payment £", round(M), sep=""))
legend("topright", c("Capital payment", "Interest payment"), fill=c("green", "red"), bty="n")

par(new=TRUE)

plot(seq(1, 300, 12), Principal[seq(1, 300, 12)], xlab="Year", ylim=c(0,160000), ylab="Remaining loan amount", xaxt="n", xlim=c(1, 330), bty="n")
axis(1, at=seq(1, 300, 12), labels=1:25)

And here’s the plot (click to enlarge)!

Fun with wordclouds

As always, I’m late to this party, and wordclouds have come under fire in recent times, e.g. here: drewconway.com. From my point of view they’re eye-catching, and I hope that by putting them up on a website or in a report they might cause people to linger and look in more detail at other pieces of data and visualisation. That’s all I’m going to say for now, I’m sure I’ll talk again about what’s attractive to data scientists and statisticians and what’s attractive to the general public, but let’s leave it for now.

I am looking at interesting ways of looking at the patient survey (see previous post) at the moment and I thought I would have a go at a wordcloud. Thanks to the wonderful people producing packages for R (the tm and wordcloud packages, many thanks to both!), it’s easy. I nicked a bit of code from another blog (thanks One R tip a day!) and pretty soon I had my own. It’s from two areas of the Trust, featuring the things people like about our services.

Here’s the code:


mydata=subset(mydatafirst, Var1==0) # slice the data by area

mycorpus=Corpus(DataframeSource(data.frame(mydata[-is.na(mydata$Best),34]))) # make a text corpus for the tm package out of a data frame, removing missing values at the same time

mycorpus <- tm_map(mycorpus, removePunctuation) # remove all the puncuation
mycorpus <- tm_map(mycorpus, tolower) # make everything lower case
mycorpus <- tm_map(mycorpus, function(x) removeWords(x, c(stopwords("english"), "none", "ive", "dont", "etc"))) # remove words without meaning, many of which are provided by the stopwords("english") function

# these next steps take the corpus and turn it into a dataframe ready for the wordcloud function

tdm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

par(mfrow=c(1,2))

wordcloud(d$word,d$freq,c(3.5,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c("sans serif","plain"))

# ...same steps again for the other area

wordcloud(d2$word,d2$freq,scale=c(4,.5),2,100,TRUE,.15, terrain.colors(5),vfont=c("sans serif","plain"))

I adjusted by eye the scale which you can see in the 3rd parameter of the wordcloud function (making the second area a bit larger). There’s probably a better way, I will investigate further as I look more at what to do with all this data.

Here’s the word cloud:

Rplot02

Not bad for a few hours’ work! I’m hoping it will draw people in to look at more of the reporting that we do at any rate. Let me know what you think of it, and word clouds generally, in the comments.