Analysis tools for Manager Joe

I’m using someone else’s data today. It’s absolutely hideously laid out. I could munge it into R but it would take absolutely ages and I’m just not doing enough with it for that to be worth doing.

So I need to have a look at about 30 survey questions using the tools available to the Average Manager Joe- a spreadsheet and the “graph” button.

It’s a real eye opener. Everything takes ages, for one thing, and everything is so janky that I’m not even really sure if I’m drawing the right conclusion. I think the most worrying thing is that the effort involved is so high that I’m losing my curiosity- I’m just trying to get it done. I’m just churning out all this rubbish, giving it a quick eyeball and crashing on.

Why does that seem so familiar? Oh yes, that’s what I’ve always assumed people have done when I read their reports. It’s a big problem, we all know it is, data is too difficult to make sense of, so people do it quickly, and wrongly. We all know this. But I’m living it right now. And I have renewed purpose to make all MY data applications beautifully easy to use. Stay tuned…

[… time passes]

I’ve come back to this post. It’s no good. I can’t do it. I’m munging the data into R, even if it will take a little while. It just goes to show, it’s really hard to get away with not doing it properly.

Failure to produce pdf with RMarkdown tidyverse

I’m using tidyverse for everything now, as I’ve mentioned in previous posts, when I want a cup of tea I just run:

house %>%
  filter(kitchen == 1) %>%
  select(tea, kettle) %>%

I just ran the following code in a vanilla RStudio setup with pdflatex installed:

title: "Test document"
author: "Chris Beeley"
date: "20 December 2017"
output: pdf_document

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)



# let's do some data stuff here...


This is the code that you get if you set up an RMarkdown project in RStudio and select “compile to LaTeX”, and you want to do some data stuff with the tidyverse package.

And it produced the following error message:

! Package inputenc Error: Unicode char √ (U+221A)
(inputenc) not set up for use with LaTeX.

See the inputenc package documentation for explanation.
Type H for immediate help.

l.145 \end{verbatim}

Try running pandoc with –latex-engine=xelatex.
pandoc: Error producing PDF
Error: pandoc document conversion failed with error 43
Execution halted

I was a bit confused by this for quite a while, the answer of course turns out to be the lovely messages which the tidyverse produces on loading:

With the default message = TRUE behaviour in the code chunk pandoc ends up trying to render those little ticks in LaTeX. Evidently it doesn’t support unicode.

So the document fails, and it’s hard to understand why until you knit to HTML and see the little ticks.

Changing the knitr::opts_chunk$set(echo = TRUE) line to knitr::opts_chunk$set(echo = TRUE, message = FALSE) fixes the problem.

I can’t help but think that this is a rare example of R getting harder to use. When I started with R 10 years ago it was much more difficult to do even simple things like load a csv file or work with dates. These days there are lots of lovely packages to help, and of course RStudio itself makes using R much more intuitive. But this is going to confuse newbies, I think, which is a bit of a shame.

There are several obvious fixes, I won’t bother to list them all, maybe make message = FALSE the default in RMarkdown documents in RStudio seems like the best one, but maybe there’s some reason they don’t want to do that.

Font size of code in .Rpres presentations

I don’t know if I even knew about the .Rpres presentation feature in RStudio v. 0.98 and above. As I think I mentioned I’ve been rather ill for the last couple of years and I’m afraid I kind of fell out of touch with things a bit. Anyway, I’m all better now and I’m going to be giving a talk at the R User Group in Nottingham (which I love profoundly) so I thought I’d do it this new sexy way.

It seems pretty handy, haven’t made the whole presentation yet so I’m sure there’s more to come but the first thing is, dang! The code in an echo = TRUE chunk is really large! I can’t fit any output on the page!

So I found this guide to making it smaller, and lots of other nice tweaks, too.

Better Git commit messages

Something else I’m trying to be better at is using Git. I did use it, briefly, a few years back but I never quite got the hang of it and I’ve reverted to the bad habit of having MainCode/ and TestingCode/ and TryNewFunction/ folders filled with near identical code.

So I’m back on the Git wagon again. Atom (see my previous blog post) has beautiful Git integration, as you’d expect since it was built by the GitHub people. It also enforces a couple of conventions with writing Git commit messages, which inspired me on a Google search which led me to this, a guide to writing better commit messages.

I never even thought about the art of it, but, of course, like code comments, good commit messages are essential for collaborating with anyone, even your future self.

Ellen Townsend: Small talk saves lives — IMH Blog (Nottingham)

It sounds much too simple doesn’t it? Making small talk could save a life. But the truth is, it really could. Today SHRG is supporting the campaign launched by the Samaritans. They are asking us all to be courageous and strike up a conversation with someone if we are worried about them at a railway […]

via Ellen Townsend: Small talk saves lives — IMH Blog (Nottingham)

Filtering data straight into a plot with tidyverse

I’m still trying to go full tidyverse, as I believe I mentioned a while back. It’s clearly a highly useful approach, but on top of this I see a load of code in blogs and tutorials that uses a tidy approach. So unless I learn it I’m not going to have a lot of luck reading it. I saw somebody do the following a little while back and I really like it so I thought I’d share it.

In days gone by I would draw lots of graphs in an RMarkdown document like this:

firstFilteredDataset = subset(wholeData, 
  Date > as.Date("2017-04-01"))

  aes(x = X1, y = y)) + geom_... etc.

secondFilteredDataset = subset(wholeData, 
  Date > as.Date("2015-01-01"))

  aes(x = X1, y = y)) + geom_... etc.

thirdFilteredDataset = ... etc.

It’s fine, there’s nothing wrong with doing that, really. The two drawbacks are firstly that the code looks a bit ungainly, creating lots of objects that are used once and then forgotten about, and secondly it is filling your RAM with data. Not really a problem on my main box, which has 16GB of RAM, but it’s a bad habit and you may come unstuck somewhere else where RAM is more limited- like for example when you’re running code on a server.

So I saw some code on the internet the other day and they just piped data straight from a dplyr filter statement to a ggplot instruction. No muss, no fuss, the data is defined in the same function in which it’s used, and you’re not making loads of objects and leaving them lying around. So here’s an example:


mpg %>% 
  filter(cyl == 4) %>%
  group_by(drv) %>%
  summarise(highwayMilesPG = mean(hwy)) %>%
  ggplot(aes(x = drv, y = highwayMilesPG)) +
  geom_bar(stat = "identity")

There’s only one word for it- it’s tidy! I like it!

One editor to rule them all- Atom

I’m very happy using RStudio for all my R code. It goes without saying that the support for R coding built into RStudio is phenomenal. If you don’t know loads of cool stuff RStudio does, you’re missing out, but that’s a blog post on its own.

I’ve never quite been happy with my choice for other general editing, though. Sometimes I write PHP, HTML, markdown, Python, or something else, and I’ve never really found an editor that I love. Geany is pretty good and that’s what I have been using when I write PHP or HTML. I tended to write markdown in RStudio, which is kind of stupid, since RStudio is an awfully big hammer to crack that nut, but it does support markdown and I’m familiar with RStudio, so I was happy enough doing that. I never really found a Python IDE that I loved. As far as I can tell there isn’t really an RStudio equivalent in the Python world, something so well featured and brilliant that it’s really the only choice unless you have a very particular reason to use something else.

So about a year ago I gave Atom a try. It had been out of beta for about a year by that point. I don’t really remember it clearly now but it seemed a bit clunky and I just rapidly gave up (to be fair, this may have just been me being thick, I’ve no idea how much it has really improved since). It just didn’t grab me. I keep seeing it mentioned everywhere and I thought I would give it another go.

This time I was hooked straight away. It’s described as a “hackable editor for the 21st Century” and that’s the real strength of it. The actual interface is very clean and simple, no bells and whistles, but it comes bundled with some plugins and there is a thriving ecosystem of user contributed packages that can make Atom, it seems so far, anything you want it to be.

I think I love Atom for the same reason I love R. It has a big ecosystem of packages around it, and whatever problem you want to solve, as Apple almost said of the iPhone, “there’s a package for that”.

Your needs will be different from mine, of course, but I recommend you give Atom a try if you haven’t already. It supports Markdown preview out of the box. So far I have installed two packages- platformio-ide-terminal, and script. Platformio-ide-terminal allows you to spawn a terminal underneath your code window, which I have been mainly using to run pandoc on my markdown files. Script will run your code for you (sections, the whole thing, etc.) all with a shortcut key. So far I’ve been using that for testing Python scripts. Oh yes, and the markdown editor supports word completion out of the box too, not code, just normal words, which is more useful than it sounds.

While I’ve been Googling Atom to find the links to put in this post I have found two really cool things that I didn’t know about. Firstly, there is the Hydrogen package, which allows Jupyter like functionality in Atom. If you don’t know what Jupyter is, you should find out, but essentially it allows you to weave together your code and output, just like you can with RMarkdown.

And secondly Atom themselves have just released teletype which is a tool that allows collaboration on code files right inside the Atom editor. I don’t really need to do that, not that I can think of anyway, but you have to admit it’s pretty awesome. They’ve solved a lot of the problems with code collaboration elsewhere, as well, have a look at the blog post for more details.

So go give Atom a try. I’ll try to post any more Atom-related awesomeness that I see on my travels.

Lazy tables with R and pander

One of the many things I love about R is how gloriously lazy it can help you to be. I’m writing a report at the moment and I need to make lots of tables in R Markdown. I need them to be proportions, expressed as a percentage, rounded to 0 decimal places, and I need to add (%) to each label on the table. That’s a lot of code when you’ve got 8 or 10 tables to draw, so I just made a function that does it. It takes two arguments, the variable you want tabulated, and the order in which you want the table. I need to specify the order manually because the default alphabetical ordering doesn’t work with all of the data that I want, as in the example here. Without ordering manually, the “Less than one year” category appears at the end.

Here’s a minimal example:


niceTable = function(x, y) {
  tempTable = round(
               levels = y)
    ) * 100, 0)
  names(tempTable) = paste0(names(tempTable), " (%)")

a = c(rep("Less than one year", 3), rep("1 - 5 years", 4), 
    rep("5 - 10 years", 2))

niceTable(a, c("Less than one year", 
    "1 - 5 years", "5 - 10 years"))

Boom! Instant laziness. Unless my boss is reading, in which case it’s efficiency 🙂

Quarters and modulo arithmetic

This is another post that’s mainly for my benefit when I inevitably forget. I’m working with dates in PHP which, unlike MySQL, does not have a built in quarter function for extracting the quarter from a year.

Even if it did, one would have to be very careful with it because quarters are actually defined differently in different countries. In the UK, where I am, April to June is the first quarter (and January to March the last). In other countries, including (I think) the US, the first quarter is January to March, and October to December the last.

With no quarter function, and the hypothetical threat of an unpredictable one, my next recourse is to divide the month by 3 in order to give me a quarter number. This is done very simply as:

echo ceil(date("m") / 3);

This will return 1 when the month is 1 to 3 (January to March), 2 when it’s 4 to 6 (April to June) etc. But as I mentioned before UK quarters don’t work like this. We don’t have quarters in a sequence 1, 2, 3, 4. They go in a sequence like 4, 1, 2, 3. Now I could write code that deals with each of those cases individually, converting 1 to 4, 2 to 1, 3 to 2, etc., but I thought it was better to do it properly (and store up generalisability for the future) by converting the 1, 2, 3, 4 sequence with modulo arithmetic. I confess, I still can’t work out how to do this other than by trial and error but it definitely involves modulo 4 because the sequence is of period 4.

To convert 1, 2, 3, 4 to 4, 1, 2, 3 one need only do the following, where x is the input value 1, 2, 3, 4:

4 – (5 – x) %% 4

As far as I can figure it the first constant (4) tells the sequence how big it should be (4, 1, 2, 3 or 10, 7, 8, 9) and the second constant (5) tells the series where it should flip back to the start (4, 1, 2, 3 or 3, 4, 1, 2).

That’s all I know, I’m really busy and don’t have time to think about it any more. Hopefully this will help someone/ me in two years’ time.

Consuming REST APIs with PHP and CURL

I wasted such a lot of time on this that I must commit it to the internet on the off chance that it helps someone else in the same situation.

If you are using PHP to consume a RESTful API via CURL and you want to manipulate the data you get back it’s very important that you set CURLOPT_RETURNTRANSFER to true. This allows you to collect the response from the server in a variable. If you don’t set this option it will just echo the return to the screen, which is obviously of no use whatsoever.

While I’m here I may as well mention as well that if you want json_decode to return an array you need to use json_decode($result, true); otherwise you get an object back. The final code I wrote looks like this:

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "http://YOUR_URL_HERE");
$result = curl_exec($ch);


$result = json_decode($result, true); // giving true to json_decode returns array