One weird trick to getting column types right with read_csv

Using read_csv from the tidyverse is so easy that I didn’t bother to look at the readr documentation for a long time. However, I’m glad I did, because there is, as they say in the click bait world, one weird trick to get your column types right with read_csv. read_csv (or the other delimited file reading functions like read_tsv) does a brilliant job guessing what column types things are but by default it only looks at 1000 rows. Fine for most datasets, but actually I have more than one dataset where the first 1000 rows are missing, which doesn’t help the parser at all. So do it manually and get it right. But what a pain, all that typing, right? Wrong. Just do this:


testSpec = read_csv("masterTest.csv")

And you’ll get this output automatically:


Parsed with column specification:
cols(
  TeamN = col_character(),
  Time = col_integer(),
  TeamC = col_double(),
  Division = col_integer(),
  Directorate = col_integer(),
  Contacts = col_integer(),
  HIS = col_character(),
  Inpatient = col_character(),
  District = col_character(),
  SubDistrict = col_character(),
  fftCategory = col_character()
)

You’re supposed to copy and paste that into a new call, putting right any mistakes. And in fact there is one, in this very spreadsheet, the parser incorrectly guesses that Inpatient is character when it is in fact integer- because the first 1000 rows are missing.

So just copy all that into a new call and fix the mistake, like this:


testSpec = read_csv("masterTest.csv", 
                    col_types = 
                      cols(TeamN = col_character(),
                           Time = col_integer(),
                           TeamC = col_double(),
                           Division = col_integer(),
                           Directorate = col_integer(),
                           Contacts = col_integer(),
                           HIS = col_character(),
                           Inpatient = col_integer(),
                           District = col_character(),
                           SubDistrict = col_character(),
                           fftCategory = col_character()
                      ))

If you’re still having problems, you can have a look using problems(testSpec).

Absolute pure genius. The more I use the tidyverse, the more I know about it, and the more I know about it, the more I love it.

Analysing runs from the Polar web flow service

Well, we’re still in New Year’s resolutions territory, so what better time to have a look at using R to analyse data collected from a run? For this analysis I have used the Polar Flow web service to download two attempts at the same Parkrun, recorded on a Polar M600 (which I love, by the way, if you’re looking for a running/ smartwatch recommendation).

The background to the analysis is in the second of the two runs I thought I was doing really well and was going to crush my PB and it ended up being exactly the same as the previous run, in terms of total time taken, but with my heart rate a lot lower.

But I didn’t really feel like I wasn’t pushing myself hard enough, so I can’t really explain why my heart rate has dropped so much without a corresponding increase in performance. One possible explanation is I have moved from being bottlenecked by the performance of my cardiovascular system to being bottlenecked by the performance of my legs, but that these two bottlenecks are very similar in terms of where they cap my pace.

It was pretty fun having a look in R. Here’s a link to the analysis as it stands.

I thought I would look at my race strategy in terms of how fast I went at each point, reasoning that maybe I let myself down on the hills or the straights or something in the second attempt. However, as you can see the pace is absolutely identical the whole way in both runs. The heart rate, as you can see, is consistently lower in the second run, and it only creeps up at the end for the sprint finish (which makes me wonder if I really was pushing myself hard enough).

I need to do more analysis. My next idea is to look at the relationship between incline, heart rate, and pace (the route is pretty hilly so this is quite important).