A joke about R

I’m so sorry, I’ve come across an R joke I must share. If you don’t care about R, look away now because this post is definitely not for you.

A scientist and a statistician are working together in the office. The scientist says to the statistician “Which is the fastest way to identify non-zero elements in an array using R?”.

The statistician replies “Yes.”

“No, which is the fastest way?”

“Yes, it is.”

“No, I’m asking you, which is the fastest way?”

“Yes, you’re right, it is.”

Etc…

[if you don’t get it and you’re still reading, here].

Thanks to Hadley Wickham, progenitor of the first R joke I’ve come across.

EDIT: I thought I stole this joke structure from the “Hu is the President of China” bit (I forget where I heard it from) but thanks to the magic of Twitter I now know that they stole it from Abbott and Costello:

Parallelising with R

I’ve been fitting some generalised linear mixed effects models with 400,000 rows or so and it takes a while even on my quad core 3.4Ghz behemoth. The time has come for me to learn how to use multiple cores with R. The parallel library which comes with the new-ish R 2.14 makes it pretty easy. For my purposes, all I needed to do was pop all of my calculations in a list and then call mclapply on them.

This is my first time parallelising so this might not be the best way to do it, if you know better please say in the comments, but if you are interested to give it a try then this should get you started.


mymodels=list("glmer(AllViol~Clinical*Dir+(1|ID),
              family=binomial, data=testF)",
              "glmer(AllViol~Risk*Dir+(1|ID), 
              family=binomial, data=testF)",
              "glmer(AllViol~Historical*Dir+(1|ID), 
              family=binomial, data=testF)",
              "glmer(SerViol~Clinical*Dir+(1|ID), 
              family=binomial, data=testSer)",
              "glmer(SerViol~Risk*Dir+(1|ID), 
              family=binomial, data=testSer)",
              "glmer(SerViol~Historical*Dir+(1|ID),
              family=binomial, data=testSer)"
              )
  
myfinal=mclapply(mymodels, function(x) eval(parse(text=x)))

One thing I would add is that this by no means gets the full speed out of my CPU, which only ran at about 30% and not even all of the time. I think there’s probably more that I could do to get a speed boost, but this has got this job done and got me thinking about parallel processing.

Quick R tip- making scripts work across OS’s

Quick thought for newbies to R. I have spent four years loading in data like this

mydatafirst=read.csv("D:\Dropbox\R-files\Project1\etc.")

on my Windows machine and like this

mydatafirst=read.csv("/home/chris/Dropbox/R-files/Project1/etc.")

on my Linux machine. I would comment out the one I wasn’t using at the time.

It’s fine for a while but once you start reading multiple dataframes (particularly if they’re not all loaded at the start) it gets very fiddly and annoying. Don’t do this. Just set your working directory. You can do it manually:

setwd(“D:\Dropbox\R-files\etc.”)

Or, even better, the seamless option:

setwd(ifelse(Sys.info()['sysname']=="Windows";,
             "D:\Dropbox\R-files\Project1\",
             "~/Dropbox/R-files/Project1/"))

Mmm, I can almost taste the hours I’ll save in the course of a year.

Because it’s Friday- animated graphs

I’ve been analysing some data and we’re a bit unsure, as we often are, as to its quality. We have a variety of clinicians rating C.tot and R.tot over a four year period for a set of patients. To summarise, there was some concern that people are rating C.tot consistently and well but not so with R.tot (I’m not going to explain the variables or where they come from because it’s not relevant and it’s an internal dataset we’re not ready to share yet).

I’m sure there is a statistical approach (essentially we need to check to see if the R.tot scores are jumping around) but I thought a graphical approach might suffice for now.

So I talk all the individuals who had been in the dataset for long enough, and put each of their score trajectories on a graph, and then animated so you see the first patient, then the second and the first, then the first second and third, etc.

Actually the R.tot scores look pretty sensible looking at the animation. Have a look and see what you think (click to animate).

So easy to achieve with the mighty ggplot and animation packages from R, here’s all the code


library(animation)
library(ggplot2)

mydf=melt(mytemp, measure.vars=c("R.tot", "C.tot"), id.vars=c("ID", "Week"))

for (i in unique(mydf$ID)) {
  
  png(filename=paste(i, ".png", sep=""))
  
  print(
    ggplot(subset(mydf, ID <= i), aes(x=Week, y=value)) + 
      geom_line(aes(group=ID)) +
      xlab("Week") + 
      opts(title = "Clinical and risk score over time") +
      facet_wrap(~variable, ncol=1)
    )
  dev.off()
  
}

## change outdir to the current working directory
ani.options(outdir = getwd(), interval=0.05)

im.convert(paste(unique(mydf$ID), ".png", sep=""), output = "animation.gif")

Amazing video- Inventing on principle

I’ve just seen an amazing video, Brett Victor talking about Inventing on Principle.

It should be of great interest to programmers and creatives and indeed anyone who needs or wants to come up with new and exciting ideas (everyone, in other words). If you’re not a programmer, stick with the video because he broadens out the discussion a lot a third of the way in. And if you are, just get stuck in and watch. You won’t regret it.