Over at this excellent blog there is an interesting discussion about times when means and medians can be deceptive, particularly in the case where two variables with equal means have very different distributions. I chimed in myself and mentioned some of the examples which I come across in my work. Here is a particularly egregious example, measurement of self-esteem in patients on psychiatric wards in England and Belgium-
England mean
3.4
Belgium mean
3.2
England median
3.3
Belgium median
3.2
England sd
1.2
Belgium sd
0.9
Looks pretty similar on the face of it. Let’s have a look at the actual distribution (click to enlarge).
Pretty different. Quite interesting to consider why the two are so different. It would appear on the face of it that the measure works better in Belgium, producing a nice normal distribution, and not so well in England, where many individuals are selecting the maximum response across all the items in the scale.
Too often, I think, we talk about non-normal distributions in terms of their median, when as you can see here, many sins can be hidden in this way. I don’t know why the self-esteem measure is behaving like this in England, but we haven’t finished with these data so look out for more on the blog as we have a more thorough look.
R code:
mygraph=function(x){ par(mfrow=c(1,2)) a=subset(mydata, country==1) b=subset(mydata, country==2) print("England mean") print(mean(a[,x], na.rm=TRUE)) print("Belgium mean") print(mean(b[,x], na.rm=TRUE)) print("England median") print(median(a[,x], na.rm=TRUE)) print("Belgium median") print(median(b[,x], na.rm=TRUE)) print("England sd") print(sd(a[,x], na.rm=TRUE)) print("Belgium sd") print(sd(b[,x], na.rm=TRUE)) hist(a[,x], main="England", xlab="Self esteem score", breaks=seq(0,5,by=.2), freq=FALSE) lines(density(a[,x], na.rm=TRUE)) hist(b[,x], main="Belgium", xlab="Self esteem score", breaks=seq(0,5,by=.2), freq=FALSE) lines(density(b[,x], na.rm=TRUE)) } mygraph("Selfesteem")