Recoding to NA with dplyr

Just very quickly again, still horribly busy, but this has been annoying me for ages and I finally figured it out. When you’re using recode from dplyr, you can’t recode to NA if you’re recoding to other things that aren’t NA, because it complains that the types aren’t compatible. So don’t do this:


  mutate(Relationship = recode(Relationship, "Co-Habiting" = "Co", 
    "Divorced" = "Di", "Married" = "Ma", "Civil partnership" = "CI"
    "Single" = "Si", "Widowed" = "Wi", "Separated" = "Se", 
    "Prefer not to say" = NA))

Use na_if() instead:


  mutate(Relationship = recode(Relationship, "Co-Habiting" = "Co", 
    "Divorced" = "Di", "Married" = "Ma", "Single" = "Si", 
    "Widowed" = "Wi", "Separated" = "Se", "Civil partnership" = "CI")) %>%
  mutate(Relationship = na_if(Relationship, "Prefer not to say", NA))

Checking memory usage in R

Wow, I cannot believe I didn’t blog this. I just tried to find it on my own blog and I must have forgotten. Anyway, it’s an answer I gave on Stack Overflow a little while ago, to do with measuring how much memory your R objects are using.


install.packages("pryr")

library(pryr)

object_size(1:10)
## 88 B

object_size(mean)
## 832 B

object_size(mtcars)
## 6.74 kB

This is from Hadley Wickham’s advanced R.

Convert icons to different colours

I’m horribly, horribly busy (stay tuned until mid May-ish to hear about what I’ve been up to) but I’ve just done something so quickly and easily that I couldn’t resist sharing it.

So I’ve downloaded a little man icon, I’m going to use it in a pictogram to show the percentage of people who agree with something. Just a bit of fun for a presentation. The man is black, but I thought it would be nice to have green men who are saying happy things (“I felt supported”, that kind of thing) and red men who are saying unhappy things (“I felt isolated”, e.g.).

I was a bit worried it would eat up too much time. I read this article about Imagemagick. I typed this into my Linux terminal:


convert blackicon.png -fuzz 40% -fill green -opaque black greenicon.png

Boom. Finished. God bless you, Linux.

Securing Shiny Server over SSL with Apache on Ubuntu

Well, I couldn’t find a guide for precisely this on the Internet, securing your Shiny Server over SSL (with Apache), but I did find the following links which were very useful and which I only adapted slightly.

Do have a read of the following if you’re so inclined, they’re both very useful, but neither told me exactly what to do.

https://ipub.com/shiny-https/
https://support.rstudio.com/hc/en-us/articles/213733868-Running-Shiny-Server-with-a-Proxy

This guide is for Ubuntu but I imagine the configuration will work on any Apache server. You’ll just need to figure out what the equivalent terminal commands are for whatever OS you’re using.

Even though the changes I’ve made to the config files were quite small I really knew nothing about Apache configuration at all so it’s been a bit of a trial. I’ll try to give a bit of explanation in case it helps you to know what you’re actually doing and show the changes to the config files to make it all work.

I will start with a disclaimer that I am just a dumb guy on the internet. I make no claims as to exactly how secure this is, whether it’s possible to bypass it, any of that. If you’re really interested in security and you don’t know what you’re doing I suggest you pay a security expert to help you. I am many things but a security expert is not one of them. I suppose I should say also that this might not even be the “correct” way to configure Apache. There are lots of different rules of thumb and general guidelines as to what goes where (it still works the other way, it’s just a way of organising things and having a generalisable approach) and I’ve ignored some of them to make this simpler. I’m not a sysadmin so I’ve rather gone with what works rather than what the industry standard is. Again, if that bothers you you’d better pay someone cleverer than me to tell you.

With all that said, let’s begin. The principle of what you’re doing is pretty simple. When you set up your Shiny Server you set it listening on port 3838, by default, or some other port. So now, when people go to, say, chrisbeeley.net:3838 Shiny hears them and delivers whatever it is they ask for. In the meantime Apache is listening on port 80. Every time someone points a web browser at chrisbeeley.net it hollers on port 80 and Apache hears it, and returns a web page. You don’t want that. Shiny isn’t secure over port 3838. What you want is a secure connection with Apache, that then forwards a request on to Shiny.

What we’re going to do is very simple. We’re going to have Apache continue to listen on port 80. However, when someone does go to port 80 Apache will send them to 443, which is secure. Apache already serves my blog with WordPress on chrisbeeley.net/ and my website (which needs an update, I know *embarrassed emoji*) on chrisbeeley.net/website. We’re going to have Apache proxy for Shiny Server whenever anyone goes to chrisbeeley.net/shinyapps.

Whenever someone goes to /shinyapps, Apache will say to them “Oh, you want the other guy, sure”. Apache then shouts over to Shiny Server (on port 3838/shinyapps) “Hey! Shiny Server! This guy wants some graphs and whatnot!”. And Shiny Server hears, because they’re listening on 3838/shinyapps, and shouts back “Sure! Here you go!”. And back and forth they go as you click around the application.

We’re going to listen on port 3838/shinyapps, rather than just on plain 3838, because if you don’t, the index page of apps you can optionally display when someone navigates to plain old 3838, instead of 3838/your-app-here doesn’t work properly- all the links to applications are 3838/the-application rather than 3838/shinyapps/the-application. For all I know there’s a better way of dealing with this problem, but this works fine and, as I already mentioned, I’m no sysadmin.

The big difference here from one of the blog posts that helped me to do this is that in their case the whole Apache server was just proxying for Shiny Server. They weren’t listening on port 80 for HTML, because they weren’t serving HTML with Apache. Apache listened on all the ports. I use my server as a web server, so I can’t do that.

So, to summarise, we’re going to have Apache listen on port 80. We’re going to redirect all requests to port 80 to port 443 (HTTPS). And we’re going to have Apache *proxy for (i.e. shout over to) the Shiny Server on this port. We’ll close the 3838 port on the firewall. Shiny Server will still listen on this port, but nobody outside the server can get to it. Only Apache can, and they will shout instructions to Shiny Server on this port, receiving graphs and buttons back which they’ll show to the user.

The first job is get Apache running. On Ubuntu you’re looking at:


sudo apt-get install apache2
sudo apt-get install -y build-essential libxml2-dev

Then run a2enmod

This creates a dialog, to which you respond:

ssl proxy proxy_ajp proxy_http rewrite deflate headers proxy_balancer proxy_connect proxy_html

That’s Apache taken care of, and the proxying modules set up.

Now you’ve done that the next job is to get an SSL certificate for your site. This used to be difficult and/ or cost money but now it is/ does neither thanks to the wonderful folks at let’s encrypt. This site is incredibly easy to use, I won’t bother telling you what to do, just shell into your server and follow the instructions. You will be asked if you wish to direct all HTTP to HTTPS. Given that Google Chrome is going to start giving warning messages for *all HTTP sites, not just ones with passwords/ credit cards, now is a good time to encrypt all your traffic (https://developers.google.com/web/updates/2016/10/avoid-not-secure-warn). Let’s do that, the rest of this guide assumes that you do.

If you used to have a Shiny Server that listened on port 3838 and so had that port open, you can now close it. Unless you only want to give your users the option to use HTTPS, and leave the HTTP there for those who want it. You can do that if you want, leave the port open. I won’t go into the firewall stuff here because there are so many ways of configuring firewalls and if you press the wrong button you’ll end up blocking ssh into your server and I don’t want to be responsible for that.

Now for the real magic. We’re going to set up a Virtual host with Apache, on port 443, but only when someone goes to chrisbeeley.net/shinyapps (which, if they use HTTP, will be automagically redirected to HTTPS), which will then act as a proxy server for Shiny Server.

Before we start, it’s worth saying that you may wish to back up your config files. This is very easy. So the next file is /etc/apache2/sites-enabled/000-default-le-ssl.conf. To back it up, just:


sudo cp /etc/apache2/sites-enabled/000-default-le-ssl.conf /etc/apache2/sites-enabled/000-default-le-ssl.conf_BACKUP

That way, if you mess everything up, just:


sudo mv /etc/apache2/sites-enabled/000-default-le-ssl.conf_BACKUP /etc/apache2/sites-enabled/000-default-le-ssl.conf

And everything is back the way it started. Phew!

Right, on to business.


sudo nano /etc/apache2/sites-enabled/000-default-le-ssl.conf

This will bring up your host definition on port 443

There’ll already be loads of stuff at the top about port 443. Leave that alone. It knows what it’s doing. We’re going to add some stuff at the bottom.

Stuff at the top...

ProxyPreserveHost On
ProxyPass /shinyapps http://0.0.0.0:3838/shinyapps
ProxyPassReverse /shinyapps http://0.0.0.0:3838/shinyapps
ServerName localhost

</VirtualHost>

As you can see, we route everything for /shinyapps to port 3838/shinyapps.

You’ve done! All HTTP traffic is now routed to HTTPS. If it’s /shinyapps, it goes to Shiny Server. If not, it goes to Apache for a normal web request.

Well done, have a cup of tea to celebrate.

Scraping the RStudio webinar list

I only just found this list of RStudio webinars, there’s loads of stuff on there, I really need to plow through a lot of it. What I really wanted was a list of them with links that I could archive and edit and rearrange so I could show which ones I am interested in, which I’ve already watched, and so on.

Well, if you’ve got a problem, and no-one else can help, then maybe you need… The R Team.

Anyway, that’s enough nostalgia. So all we need is the mighty rvest package and just a little sprinkling of paste0() and we’re away.

Oh yes, and you’ll also need selector gadget, which is described brilliantly in this selector gadget vignette.

Once you’ve got all that, the code writes itself. The only wrinkle I ironed out was that some of the HTML paths were relative, not absolute, so I paste http://blah on the front of those ones, as you’ll see.


library(rvest)

rstudio = read_html("https://www.rstudio.com/resources/webinars/")

linkText = rstudio %>%
  html_nodes('.toggle-content a') %>%
  html_text()

linkURL = rstudio %>%
  html_nodes(".toggle-content a") %>%
  html_attr("href")

linkURL[substr(linkURL, 1, 4) != "http"] = 
  paste0("https://www.rstudio.com", 
         linkURL[substr(linkURL, 1, 4) != "http"])

cat(paste0("<a href = ", linkURL, ">", 
           linkText, "</a><br>"), file = "webinar.html")

Done! Now all I did was open the resulting file and paste it into Evernote, which kept the links and text together, as you’d expect, and I can now cut and paste and markup to my heart’s desire.

I love it when a plan comes together.

One weird trick to getting column types right with read_csv

Using read_csv from the tidyverse is so easy that I didn’t bother to look at the readr documentation for a long time. However, I’m glad I did, because there is, as they say in the click bait world, one weird trick to get your column types right with read_csv. read_csv (or the other delimited file reading functions like read_tsv) does a brilliant job guessing what column types things are but by default it only looks at 1000 rows. Fine for most datasets, but actually I have more than one dataset where the first 1000 rows are missing, which doesn’t help the parser at all. So do it manually and get it right. But what a pain, all that typing, right? Wrong. Just do this:


testSpec = read_csv("masterTest.csv")

And you’ll get this output automatically:


Parsed with column specification:
cols(
  TeamN = col_character(),
  Time = col_integer(),
  TeamC = col_double(),
  Division = col_integer(),
  Directorate = col_integer(),
  Contacts = col_integer(),
  HIS = col_character(),
  Inpatient = col_character(),
  District = col_character(),
  SubDistrict = col_character(),
  fftCategory = col_character()
)

You’re supposed to copy and paste that into a new call, putting right any mistakes. And in fact there is one, in this very spreadsheet, the parser incorrectly guesses that Inpatient is character when it is in fact integer- because the first 1000 rows are missing.

So just copy all that into a new call and fix the mistake, like this:


testSpec = read_csv("masterTest.csv", 
                    col_types = 
                      cols(TeamN = col_character(),
                           Time = col_integer(),
                           TeamC = col_double(),
                           Division = col_integer(),
                           Directorate = col_integer(),
                           Contacts = col_integer(),
                           HIS = col_character(),
                           Inpatient = col_integer(),
                           District = col_character(),
                           SubDistrict = col_character(),
                           fftCategory = col_character()
                      ))

If you’re still having problems, you can have a look using problems(testSpec).

Absolute pure genius. The more I use the tidyverse, the more I know about it, and the more I know about it, the more I love it.

Analysing runs from the Polar web flow service

Well, we’re still in New Year’s resolutions territory, so what better time to have a look at using R to analyse data collected from a run? For this analysis I have used the Polar Flow web service to download two attempts at the same Parkrun, recorded on a Polar M600 (which I love, by the way, if you’re looking for a running/ smartwatch recommendation).

The background to the analysis is in the second of the two runs I thought I was doing really well and was going to crush my PB and it ended up being exactly the same as the previous run, in terms of total time taken, but with my heart rate a lot lower.

But I didn’t really feel like I wasn’t pushing myself hard enough, so I can’t really explain why my heart rate has dropped so much without a corresponding increase in performance. One possible explanation is I have moved from being bottlenecked by the performance of my cardiovascular system to being bottlenecked by the performance of my legs, but that these two bottlenecks are very similar in terms of where they cap my pace.

It was pretty fun having a look in R. Here’s a link to the analysis as it stands.

I thought I would look at my race strategy in terms of how fast I went at each point, reasoning that maybe I let myself down on the hills or the straights or something in the second attempt. However, as you can see the pace is absolutely identical the whole way in both runs. The heart rate, as you can see, is consistently lower in the second run, and it only creeps up at the end for the sprint finish (which makes me wonder if I really was pushing myself hard enough).

I need to do more analysis. My next idea is to look at the relationship between incline, heart rate, and pace (the route is pretty hilly so this is quite important).

New Year blog post

So my favourite productivity guru, David Allen, he of Getting Things Done fame, has a suggestion for a New Year review that you should carry out each year. His suggested structure can be found here.

I think this is an excellent idea so I have set myself a recurring reminder to do it each year, starting now. And I may as well blog it so I can find it again and anyone else who’s interested can have a read and maybe be inspired to do their own.

I had a bit of a rough year in 2017 and required surgery to remove my colon and spleen after a long battle with ulcerative colitis, so the 2017 bit is going to be a bit shorter than it will be next year. Still, I did achieve some stuff so let’s have a look.

The most sort of visible thing I did this year was make some videos about Shiny. There are two courses, and they’re available to buy (or you can watch them with a subscription) at the links Getting Started with Shiny and UI Development with Shiny. There are some angry people on Amazon reviewing my book who clearly wanted it to have more advanced applications in so I’ll warn you here they don’t feature any highly complex applications. They’re more for people who are starting out. I have another one coming out soon called Advanced Shiny, I can’t link to it yet because it doesn’t exist. Again that’s sort of moderately advanced- interacting with databases, JavaScript, adding a password, stuff like that. So don’t buy them or subscribe to the video service hoping for some real high level stuff, because it’s not in there. I’d hate for you to feel like you wasted your money.

Anyway, I also started doing a lot of work with Shiny at work, it’s taking off in a big way, and when I talk about 2018 I’ll be saying that I’ve got big plans at work, my role is developing quite a lot based on what we’ve achieved already in 2017. I’ve also learned quite a lot of stuff about text analytics, with the hope of relaunching a Shiny application I wrote a few years back and currently maintain with a lot of tools in it to allow the user to explore text based data. I’ll say more about that next year when I’ve actually done it.

I also learned how to implement an API in PHP, which is pretty easy really when you know how but it was cool to learn anyway.

I was in the right place at the right time, really, at work, in terms of having the experience with Shiny that I do and being able to help with this project that’s developing, so I’ve been quite lucky to be on board but I also give myself some credit for reading the runes and training myself up in Shiny to be ready to help with something like this. The skills I’ve acquired both in Shiny programming and in running Linux servers to run Shiny Server and relevant databases on are suddenly very valuable to my organisation, so I feel I called this one correctly. My next prediction is text analysis, I feel like if I can learn that and do it really well over the next five years then there could be opportunities there.

In terms of my personal life, really it feels like all I did was be very, very ill with severe inflammatory bowel disease and then have surgery to correct it. That consumed a lot of my energy, really. I’ve now recovered and I’m back doing 10 mile runs at the weekend, which is great. I’ve actually very tentatively started writing a book about my experiences being ill, which I’ll talk about more if I actually get anywhere with it next year.

So in 2018 I’ll be developing my role at work, and having much more input more widely across the organisation both in terms of Shiny, live analytics, dashboards and all that stuff, but also in terms of statistics and analysing real datasets from healthcare services, which is where I started out, really, finishing my psychology PhD in 2008 looking at routinely collected data in psychiatric hospitals with mixed effects models. And as I mentioned I’ll be improving one of my Shiny applications, adding a lot of tools to help users explore the text.

In my personal life I want to be a really good dad since my kids got a bit of raw deal with my being sick in 2017 and I’m going to run a marathon in less than four hours. I didn’t run for 3 years or so because of having liver disease and then bowel disease so I really owe it to myself to get a nice marathon PB under my belt (current PB is 4’14”). And I’m going to try to clear the decks a bit and get writing a book about my experiences being ill, a lot of people have told me that they think it would be good so I’m going to have a go.

David Allen has some questions, I missed out the ones for the year just gone because it was such a strange year, but let’s look at some for 2018.

What would you like to be your biggest triumph in 2018?
If I can make this new job role work then I’ll be really pleased because it’s a big important step up for me and I think it will be really valuable to the Trust. And I really want to run a sub 4 marathon. If I can do those two things, I’m happy.

What is the major effort you are planning to improve your financial results in 2018?
I’ve actually got a savings account now, after being hopeless with money in the past. I’m really trying to save up for things instead of impulsively buying them, and with a bit of luck I can treat myself to a beautiful Entroware laptop in 2018

What major indulgence are you willing to experience in 2018?
Now the kids are a bit older I’d love to go on a skiing holiday with them. I love snowboarding and haven’t been for ages.

What would you most like to change about yourself in 2018?
Definitely better organised and less forgetful. I’m really trying hard at the moment to put reminders in my phone for everything, work stuff, buying presents, stuff for my kids, everything.

What is one as yet undeveloped talent you are willing to explore in 2018?
I’m really going to have to get big picture mode at work. I’ve had my head down learning the stack for processing patient experience data, but in 2018 I need to work much more widely- with say finance data, or HR, clinical outcomes. I’m really looking forward to getting my teeth into that.

What brings you the most joy and how are you going to do or have more of that in 2018?
Running. Lots and lots and lots of lovely running. I’m making time to do that, already am.

Who or what, other than yourself, are you most committed to loving and serving in 2018?
My kids deserve a really good dad who’s not ill all the time, and that’s exactly what they’re going to get.

What one word would you like to have as your theme in 2018?
Health. Beautiful, glorious, wonderful health.

Gathering data (wide to long) in the tidyverse

I seem to have a pretty severe mental block about the gather() function from dplyr so this is yet another post that to be honest is basically for me to refer to in 6 months when I forget all this stuff. So I’m going to address the mental block I have very specifically and show some code; hopefully it will help someone else out there.

So whenever I use gather I put the whole dataframe in. Say I’ve got ten variables. I whack the whole dataframe in and try to pull out just the ones I want using the select() notation at the end of the list of arguments. This DOES NOT MAKE ANY SENSE. You can’t do this:


theData = tibble(ID = 1:10, Q1 = runif(10),
  Q2 = runif(10),
  Q3 = runif(10),
  Q4 = runif(10),
  Q5 = runif(10))

gather(theData, key = Question, value = Score, Q1, Q2, Q3)

This does not work! I don’t know why I think it does! What do I think is going to happen to the ID column? It’s just going to magically go away?

I DON’T KNOW WHY I’M SO BAD AT THIS.

It’s going to gather the whole dataframe, and you just end up with a huge mess. The other thing to say is, and I have started to get the hang of this, but just in case. THE KEY AND VALUE ARGUMENTS YOU JUST MAKE UP. THEY ARE *NOT* RELATED TO THE NAMES OF THE DATAFRAME AT ALL.

What you actually do is get JUST THE VARIABLES YOU WANT, and then you need to decide whether you want any other variables, but not gather them. So as a concrete example, let’s say you want to gather Q1 – Q3 and keep the ID column. You want to put the ID column in, but you don’t want to GATHER it. So you put it in the select statement, but use -ID in the gather statement:


testData %>%
  select(ID : Q3) %>%
  gather(key = Question, value = Score, -ID)

# A tibble: 30 x 3
  ID Question Score
  <int> <chr> <dbl>
 1 1 Q1 0.26001265
 2 2 Q1 0.34674771
 3 3 Q1 0.43080742
 4 4 Q1 0.28397929
 5 5 Q1 0.14545496
 6 6 Q1 0.63496928
 7 7 Q1 0.78777785
 8 8 Q1 0.44622476
 9 9 Q1 0.86785324
10 10 Q1 0.02611436
# ... with 20 more rows

Or if you don’t want the ID column (not doing anything useful in this particular, made up, case):


testData %>%
  select(Q1 : Q3) %>%
  gather(key = Question, value = Score, Q1 : Q3)

# A tibble: 30 x 2
  Question Score
  <chr> <dbl>
 1 Q1 0.26001265
 2 Q1 0.34674771
 3 Q1 0.43080742
 4 Q1 0.28397929
 5 Q1 0.14545496
 6 Q1 0.63496928
 7 Q1 0.78777785
 8 Q1 0.44622476
 9 Q1 0.86785324
10 Q1 0.02611436
# ... with 20 more rows

Note that by default it will include ALL variables anyway, so this is totally equivalent to:


testData %>%
  select(Q1 : Q3) %>%
  gather(key = Question, value = Score)

That’s it! As I said at the beginning of the post, I have no idea why I have such a ridiculous mental block about it, it’s all in the documentation, I just get all the columns references and the – notation and all that stuff mixed up (I think partly because using -ID KEEPS the ID variable, it just doesn’t GATHER it). It’s my fault for being an idiot, but the next time I get stuck I’ll read this and understand clearly 🙂

Oh yes, last thing, Q1 : Q3 is just “from Q1 to Q3”, meaing Q1, Q2, and Q3, and Q3 : Q5 would be Q3, Q4, Q5 etc. There are lots of ways to select the variables. See more at ?gather and ?select (which uses the same variable name rules).

One neat trick is num_range(), which is a shortcut to selecting ranges of things like Q1, Q2, Q3, X1, X2, X3 and so on. You just give the prefix and the numbers you want-


testData %>%
  select(num_range("Q", 1:3)) %>%
  gather(key = Question, value = Score)

Right, I’ll stop now, this post is getting too long.

Analysis tools for Manager Joe

I’m using someone else’s data today. It’s absolutely hideously laid out. I could munge it into R but it would take absolutely ages and I’m just not doing enough with it for that to be worth doing.

So I need to have a look at about 30 survey questions using the tools available to the Average Manager Joe- a spreadsheet and the “graph” button.

It’s a real eye opener. Everything takes ages, for one thing, and everything is so janky that I’m not even really sure if I’m drawing the right conclusion. I think the most worrying thing is that the effort involved is so high that I’m losing my curiosity- I’m just trying to get it done. I’m just churning out all this rubbish, giving it a quick eyeball and crashing on.

Why does that seem so familiar? Oh yes, that’s what I’ve always assumed people have done when I read their reports. It’s a big problem, we all know it is, data is too difficult to make sense of, so people do it quickly, and wrongly. We all know this. But I’m living it right now. And I have renewed purpose to make all MY data applications beautifully easy to use. Stay tuned…

[… time passes]

I’ve come back to this post. It’s no good. I can’t do it. I’m munging the data into R, even if it will take a little while. It just goes to show, it’s really hard to get away with not doing it properly.