Shiny Pro in a Windows/ NHS environment, part 1: The curse of Kerberos

Right, here goes. Let’s tell the story. I’ve been working on this for months, not because it’s particularly hard, but because I really didn’t know what I was doing. We are a health trust in the UK (NHS), and we’ve been using Shiny on a privately hosted Linode server (Ubuntu) in the cloud for a long time. You can see it here. We’ve done this (hosted it in the wild) because none of the data is private. It’s patient experience data. We want it to be public. If anything, we want people to steal it. To be honest it’s looking a bit tatty. I was proud of it in 2012, now not so much. I’m working on a new version.

Anyway, we now want to use Shiny with actual patient data. This means two things. Firstly, we have to host the data behind our firewall. And secondly, we need to authenticate users. We don’t want just anybody in the Trust getting access to it. I should say that in truth, the information governance requirements are fairly mild, since all the data is aggregated and/ or pseudonymised. There is no personal data in the application anywhere. Anyway, we want to authenticate users.

So we went out and bought Shiny Server Pro. As a keen user of R and Shiny, I’m very excited about this. I think it very likely that we are the first Trust in the country to buy a Pro licence. There’s some debate in the Trust about R’s place in our BI ecosystem, which I won’t go into here, but nonetheless, we’re doing it, now, and it’s very exciting for me.

So it was my job to make it all work. This means two things. Firstly, we need the server to authenticate against MS SQL Server so we can fetch the data nightly from a data warehouse. If you use passwords, obviously, it’s easy, but my Trust likes to be more secure than that so Kerberos it is. And secondly we need to use the LDAP features of Pro to authenticate users against the Active Directory user database which the Trust maintains. This post will just be about Kerberos. I’ll write up the LDAP stuff another day (and add a link here once I do). The actual server, in case you’re wondering, is Ubuntu 16.04 running as a virtual machine on the Trust’s completely Windows setup.

Now, if you’re as ignorant as I was at the start it will help you to understand what Kerberos is. Basically, it’s a little program that lives on the Trust’s servers and it issues tickets if you’re allowed to do something. So all you need to do is install a client program which goes to Kerberos and says “Hello, I’m so and so, and my password is such and such, can I have a ticket please?” and Kerberos looks it up and if the server’s name is, as they say, “down”, it gives the server a ticket. This ticket lasts for a certain period of time (I think the Kerberos server gets to configure this) and it will let the client server do all the stuff on the ticket. So I need a ticket that says “let this guy into this bit of the data warehouse”, and Kerberos gives me a ticket, and then I can go to SQL Server, show the ticket, and be let in.

I’m going to do my best to explain how to configure it all. I’m no expert (obviously!) so I will probably miss bits out but I personally would have found a blog post like this that gave a practical sequence of what to do useful when I was learning, so I hope you will find it useful. And if you don’t, as usual it will help me in 6 months when I forget it all :-).

First things first, note that the clocks on the client and server must match. Mine did, so I didn’t worry about this. Right, let’s look first at where we’re trying to get to. Right at the top of my R code I want to run this:

odbcDriverConnect(connection =
  "Driver={SQLServer};
  server=serverName;
  database=dbName;
  trusted_connection=yes;")

This will run fine if you have the right Kerberos ticket, and not at all if you don’t. Next let’s make sure we’ve got some ODBC drivers. As a user of Shiny Server Pro, I get access to RStudio’s paid drivers, so I used them. Microsoft have some for download, too. I also get access to RStudio support, which I have to say is excellent. Install them. You’ll need to configure the name and file path in here: /etc/odbcinst.ini (the name of the driver in the odbcDriverConnect above is defined in this file).

Okay. So far so good. Now we’re getting to the more tricky parts. Next you’ll need a Kerberos client:

sudo apt-get install krb5-user

Ask your IT department about the location of their Kerberos server. You’ll need to put it in the [realms] section of /etc/krb5.conf, like so:

[realms]

EXAMPLE.ORG = {
kdc = example.org:88
}

Obviously replace example.org with whatever your IT department tell you. Note you put it in twice, the first time in capital letters. I’m not going to insult your intelligence by pretending I understand why that is.

And also here:

[libdefaults]

default_realm = EXAMPLE.ORG

Okay, getting there now. You’ll need a username and password from IT. To test Kerberos just run

kinit -p username@EXAMPLE.ORG

Again, the server name is in capitals. Again, I do not know why. You’ll be prompted for your password. Hopefully that should work. If it does, you now have a ticket. You can test your connection to SQL Server using isql (https://docs.oracle.com/cd/E88353_01/html/E37839/isql-1.html).

Wait, you’re still not done! I’m assuming, like me, you want to run the whole thing in a script, either because you want to run overnight as a cron job (as I did) or just because you don’t want to faff around with terminal commands every time you run an R script. Your credentials will expire, so we need to be able to get them again in the script. You can save your password in a script BUT DON’T. It’s bad practice, and I feel sure that your IT department will get cross. You need a keyfile. I’m not going to go through all of that, there’s a fair bit to it, but there is an excellent guide to using a keyfile here, go and follow it.

That’s it! You’ve done it! Now all you need to do is run the following at the terminal, replacing the /home/chrisbeeley bit with the name of your keyfile, and the scriptName.R with your script.

Before I go, there’s quite a useful page here, which goes into a bit more detail than I have https://www.easysoft.com/products/data_access/odbc-sql-server-driver/kerberos.html.

This took me WEEKS, because I DID NOT KNOW WHAT I WAS DOING. It’s my sincere wish that it helps you if you want to do it. I’m more than happy to talk it through with anyone out there, as I mentioned I’m not an expert but I might have had a similar problem to you. Find me @ChrisBeeley on Twitter. And that goes double for anybody who works in the NHS. It’s one of my career goals to support the adoption of Linux/ R/ Shiny in the NHS and if I can help at all, I will.

Tune in next time for the LDAP story, which was even more hideously difficult (at least, for my tiny brain).

Passing strings as variables from a Shiny app into dplyr

I’ve been sort of waiting until I understood this thoroughly, and I was going to write a very detailed blog post about it, and although I do understand it a lot better now than I did, I’m still not at the point where I would write an authoritative blog post about it confidently because there are too many things that I don’t understand.

However, I’m aware that there are people out there right now who are trying to write dplyr code that takes strings as variable inputs, passed in from a Shiny interface, and I know how difficult it is to Google, because I Googled it myself. So I’m going to at least show you what to do, and talk a bit about it, and then maybe I’ll come back to it when I’ve understood it better myself.

So you quite often have a combo box in Shiny applications that returns a variable name, and you want to ask dplyr to filter/ select/ whatever on that name. So you have the variable name in the form of a string, and you want to pass it into dplyr. The old way, which still works, but is now deprecated, uses the filter_ function which evaluates a whole string as if it were dplyr code. So, for example, you might do this:

library(tidyverse)

input = data.frame("variable" = "cty")

mpg %>%
  filter_(paste(input$variable, " > 18"))

input = data.frame("variable" = "hwy")

mpg %>%
  filter_(paste(input$variable, " > 18"))

I’ve given the dataframe the name “input” to make it look like a Shiny application. So imagine your user clicks on “cty” in your app, which makes input$variable equal to “cty”. Now you just paste that together with a filter condition (“> 18”) and pass the whole thing to filter_(). Now your user perhaps sets the variable to “hwy”, and the calculation can be done with the new value.

So what’s the new way? Well, as I mentioned I don’t understand it deeply, so read the following at your own risk, but essentially what you’re doing is described in this blog post about dplyr programming. You need to quote the variable names which sort of makes them variable-y (told you I didn’t really understand) and then you need to unquote them. Normally you would be doing this:


input = quo(cty)

mpg %>%
  filter(!!(input) > 18)

input = quo(hwy)

mpg %>%
  filter(!!(input) > 18)

So (I think!) you’re saying to R- hey, listen, R, input is cty as a variable name. Wherever you see input, pretend it’s cty but as a variable name. This is called quoting. Then you have to unquote, using !!(). So you’re saying- that thing I mentioned about input being cty quoted? Well, I want it now, unquote input (using !!()) and then use that variable name in the following.

So you’re basically doing that except! Except you’re quoting a string, not a bare variable name. To quote strings, use sym. Other than that, it’s the same. So do this:


input = sym("cty")

mpg %>%
  filter(!!(input) > 18)

input = sym("hwy")

mpg %>%
  filter(!!(input) > 18)

So now you can pass variable names from comboboxes in Shiny applications using dplyr.

I’m sorry I don’t have the detail as nailed as I would like, I wouldn’t advice sitting any exams after reading this, but it should get you coding, at least, which is the first step. If the fog ever truly clears for me I promise I’ll come back and write more.

Selectively hide text with JavaScript in RMarkdown

I guess this is one of those where I kind of did know it was possible, really. If I’d thought it through. But I’ve always not been sure how easy it would be and I’ve been in too much of a rush.

So I’m trying to move my organisation onto HTML instead of Word. HTML is easier to output and parse, and it’s interactive with a bit of JavaScript. I used the DT package in R to put some tables into an RMarkdown document and of course that’s very nice because they’re pageable, orderable, and searchable out of the box.

But there’s often a lot of text that I’m not sure everyone will be interested in. We collect a lot of comments and some people want every single one. Some people just want the tables. So it would be nice to be able to selectively show and hide the text in each section.

And I did know this, really. But RMarkdown documents accept pure HTML. And pure JavaScript. So it’s embarrassingly easy. Ridiculously easy. Here’s one. I half stole it from w3chools.com


---
title: "JavaScript test"
author: "Chris Beeley"
date: "11 May 2018"
output: html_document
---

<script>
function myFunction() {
    var x = document.getElementById("myDIV");
    if (x.style.display === "none") {
        x.style.display = "block";
    } else {
        x.style.display = "none";
    }
}
</script>

<button onclick="myFunction()">Show/ hide</button>

<div id="myDIV">

Some comments here.

There are lots so it's nice.

To be able to hide them.

</div>

That’s it. Boom. Done. If you’re not all doing that by the middle of next week then you need to be asking yourselves why.

Recoding to NA with dplyr

Just very quickly again, still horribly busy, but this has been annoying me for ages and I finally figured it out. When you’re using recode from dplyr, you can’t recode to NA if you’re recoding to other things that aren’t NA, because it complains that the types aren’t compatible. So don’t do this:


  mutate(Relationship = recode(Relationship, "Co-Habiting" = "Co", 
    "Divorced" = "Di", "Married" = "Ma", "Civil partnership" = "CI"
    "Single" = "Si", "Widowed" = "Wi", "Separated" = "Se", 
    "Prefer not to say" = NA))

Use na_if() instead:


  mutate(Relationship = recode(Relationship, "Co-Habiting" = "Co", 
    "Divorced" = "Di", "Married" = "Ma", "Single" = "Si", 
    "Widowed" = "Wi", "Separated" = "Se", "Civil partnership" = "CI")) %>%
  mutate(Relationship = na_if(Relationship, "Prefer not to say", NA))

Checking memory usage in R

Wow, I cannot believe I didn’t blog this. I just tried to find it on my own blog and I must have forgotten. Anyway, it’s an answer I gave on Stack Overflow a little while ago, to do with measuring how much memory your R objects are using.


install.packages("pryr")

library(pryr)

object_size(1:10)
## 88 B

object_size(mean)
## 832 B

object_size(mtcars)
## 6.74 kB

This is from Hadley Wickham’s advanced R.

Convert icons to different colours

I’m horribly, horribly busy (stay tuned until mid May-ish to hear about what I’ve been up to) but I’ve just done something so quickly and easily that I couldn’t resist sharing it.

So I’ve downloaded a little man icon, I’m going to use it in a pictogram to show the percentage of people who agree with something. Just a bit of fun for a presentation. The man is black, but I thought it would be nice to have green men who are saying happy things (“I felt supported”, that kind of thing) and red men who are saying unhappy things (“I felt isolated”, e.g.).

I was a bit worried it would eat up too much time. I read this article about Imagemagick. I typed this into my Linux terminal:


convert blackicon.png -fuzz 40% -fill green -opaque black greenicon.png

Boom. Finished. God bless you, Linux.

Securing Shiny Server over SSL with Apache on Ubuntu

Well, I couldn’t find a guide for precisely this on the Internet, securing your Shiny Server over SSL (with Apache), but I did find the following links which were very useful and which I only adapted slightly.

Do have a read of the following if you’re so inclined, they’re both very useful, but neither told me exactly what to do.

https://ipub.com/shiny-https/
https://support.rstudio.com/hc/en-us/articles/213733868-Running-Shiny-Server-with-a-Proxy

This guide is for Ubuntu but I imagine the configuration will work on any Apache server. You’ll just need to figure out what the equivalent terminal commands are for whatever OS you’re using.

Even though the changes I’ve made to the config files were quite small I really knew nothing about Apache configuration at all so it’s been a bit of a trial. I’ll try to give a bit of explanation in case it helps you to know what you’re actually doing and show the changes to the config files to make it all work.

I will start with a disclaimer that I am just a dumb guy on the internet. I make no claims as to exactly how secure this is, whether it’s possible to bypass it, any of that. If you’re really interested in security and you don’t know what you’re doing I suggest you pay a security expert to help you. I am many things but a security expert is not one of them. I suppose I should say also that this might not even be the “correct” way to configure Apache. There are lots of different rules of thumb and general guidelines as to what goes where (it still works the other way, it’s just a way of organising things and having a generalisable approach) and I’ve ignored some of them to make this simpler. I’m not a sysadmin so I’ve rather gone with what works rather than what the industry standard is. Again, if that bothers you you’d better pay someone cleverer than me to tell you.

With all that said, let’s begin. The principle of what you’re doing is pretty simple. When you set up your Shiny Server you set it listening on port 3838, by default, or some other port. So now, when people go to, say, chrisbeeley.net:3838 Shiny hears them and delivers whatever it is they ask for. In the meantime Apache is listening on port 80. Every time someone points a web browser at chrisbeeley.net it hollers on port 80 and Apache hears it, and returns a web page. You don’t want that. Shiny isn’t secure over port 3838. What you want is a secure connection with Apache, that then forwards a request on to Shiny.

What we’re going to do is very simple. We’re going to have Apache continue to listen on port 80. However, when someone does go to port 80 Apache will send them to 443, which is secure. Apache already serves my blog with WordPress on chrisbeeley.net/ and my website (which needs an update, I know *embarrassed emoji*) on chrisbeeley.net/website. We’re going to have Apache proxy for Shiny Server whenever anyone goes to chrisbeeley.net/shinyapps.

Whenever someone goes to /shinyapps, Apache will say to them “Oh, you want the other guy, sure”. Apache then shouts over to Shiny Server (on port 3838/shinyapps) “Hey! Shiny Server! This guy wants some graphs and whatnot!”. And Shiny Server hears, because they’re listening on 3838/shinyapps, and shouts back “Sure! Here you go!”. And back and forth they go as you click around the application.

We’re going to listen on port 3838/shinyapps, rather than just on plain 3838, because if you don’t, the index page of apps you can optionally display when someone navigates to plain old 3838, instead of 3838/your-app-here doesn’t work properly- all the links to applications are 3838/the-application rather than 3838/shinyapps/the-application. For all I know there’s a better way of dealing with this problem, but this works fine and, as I already mentioned, I’m no sysadmin.

The big difference here from one of the blog posts that helped me to do this is that in their case the whole Apache server was just proxying for Shiny Server. They weren’t listening on port 80 for HTML, because they weren’t serving HTML with Apache. Apache listened on all the ports. I use my server as a web server, so I can’t do that.

So, to summarise, we’re going to have Apache listen on port 80. We’re going to redirect all requests to port 80 to port 443 (HTTPS). And we’re going to have Apache *proxy for (i.e. shout over to) the Shiny Server on this port. We’ll close the 3838 port on the firewall. Shiny Server will still listen on this port, but nobody outside the server can get to it. Only Apache can, and they will shout instructions to Shiny Server on this port, receiving graphs and buttons back which they’ll show to the user.

The first job is get Apache running. On Ubuntu you’re looking at:


sudo apt-get install apache2
sudo apt-get install -y build-essential libxml2-dev

Then run a2enmod

This creates a dialog, to which you respond:

ssl proxy proxy_ajp proxy_http rewrite deflate headers proxy_balancer proxy_connect proxy_html

That’s Apache taken care of, and the proxying modules set up.

Now you’ve done that the next job is to get an SSL certificate for your site. This used to be difficult and/ or cost money but now it is/ does neither thanks to the wonderful folks at let’s encrypt. This site is incredibly easy to use, I won’t bother telling you what to do, just shell into your server and follow the instructions. You will be asked if you wish to direct all HTTP to HTTPS. Given that Google Chrome is going to start giving warning messages for *all HTTP sites, not just ones with passwords/ credit cards, now is a good time to encrypt all your traffic (https://developers.google.com/web/updates/2016/10/avoid-not-secure-warn). Let’s do that, the rest of this guide assumes that you do.

If you used to have a Shiny Server that listened on port 3838 and so had that port open, you can now close it. Unless you only want to give your users the option to use HTTPS, and leave the HTTP there for those who want it. You can do that if you want, leave the port open. I won’t go into the firewall stuff here because there are so many ways of configuring firewalls and if you press the wrong button you’ll end up blocking ssh into your server and I don’t want to be responsible for that.

Now for the real magic. We’re going to set up a Virtual host with Apache, on port 443, but only when someone goes to chrisbeeley.net/shinyapps (which, if they use HTTP, will be automagically redirected to HTTPS), which will then act as a proxy server for Shiny Server.

Before we start, it’s worth saying that you may wish to back up your config files. This is very easy. So the next file is /etc/apache2/sites-enabled/000-default-le-ssl.conf. To back it up, just:


sudo cp /etc/apache2/sites-enabled/000-default-le-ssl.conf /etc/apache2/sites-enabled/000-default-le-ssl.conf_BACKUP

That way, if you mess everything up, just:


sudo mv /etc/apache2/sites-enabled/000-default-le-ssl.conf_BACKUP /etc/apache2/sites-enabled/000-default-le-ssl.conf

And everything is back the way it started. Phew!

Right, on to business.


sudo nano /etc/apache2/sites-enabled/000-default-le-ssl.conf

This will bring up your host definition on port 443

There’ll already be loads of stuff at the top about port 443. Leave that alone. It knows what it’s doing. We’re going to add some stuff at the bottom.

Stuff at the top...

ProxyPreserveHost On
ProxyPass /shinyapps http://0.0.0.0:3838/shinyapps
ProxyPassReverse /shinyapps http://0.0.0.0:3838/shinyapps
ServerName localhost

</VirtualHost>

As you can see, we route everything for /shinyapps to port 3838/shinyapps.

You’ve done! All HTTP traffic is now routed to HTTPS. If it’s /shinyapps, it goes to Shiny Server. If not, it goes to Apache for a normal web request.

Well done, have a cup of tea to celebrate.

Scraping the RStudio webinar list

I only just found this list of RStudio webinars, there’s loads of stuff on there, I really need to plow through a lot of it. What I really wanted was a list of them with links that I could archive and edit and rearrange so I could show which ones I am interested in, which I’ve already watched, and so on.

Well, if you’ve got a problem, and no-one else can help, then maybe you need… The R Team.

Anyway, that’s enough nostalgia. So all we need is the mighty rvest package and just a little sprinkling of paste0() and we’re away.

Oh yes, and you’ll also need selector gadget, which is described brilliantly in this selector gadget vignette.

Once you’ve got all that, the code writes itself. The only wrinkle I ironed out was that some of the HTML paths were relative, not absolute, so I paste http://blah on the front of those ones, as you’ll see.


library(rvest)

rstudio = read_html("https://www.rstudio.com/resources/webinars/")

linkText = rstudio %>%
  html_nodes('.toggle-content a') %>%
  html_text()

linkURL = rstudio %>%
  html_nodes(".toggle-content a") %>%
  html_attr("href")

linkURL[substr(linkURL, 1, 4) != "http"] = 
  paste0("https://www.rstudio.com", 
         linkURL[substr(linkURL, 1, 4) != "http"])

cat(paste0("<a href = ", linkURL, ">", 
           linkText, "</a><br>"), file = "webinar.html")

Done! Now all I did was open the resulting file and paste it into Evernote, which kept the links and text together, as you’d expect, and I can now cut and paste and markup to my heart’s desire.

I love it when a plan comes together.

One weird trick to getting column types right with read_csv

Using read_csv from the tidyverse is so easy that I didn’t bother to look at the readr documentation for a long time. However, I’m glad I did, because there is, as they say in the click bait world, one weird trick to get your column types right with read_csv. read_csv (or the other delimited file reading functions like read_tsv) does a brilliant job guessing what column types things are but by default it only looks at 1000 rows. Fine for most datasets, but actually I have more than one dataset where the first 1000 rows are missing, which doesn’t help the parser at all. So do it manually and get it right. But what a pain, all that typing, right? Wrong. Just do this:


testSpec = read_csv("masterTest.csv")

And you’ll get this output automatically:


Parsed with column specification:
cols(
  TeamN = col_character(),
  Time = col_integer(),
  TeamC = col_double(),
  Division = col_integer(),
  Directorate = col_integer(),
  Contacts = col_integer(),
  HIS = col_character(),
  Inpatient = col_character(),
  District = col_character(),
  SubDistrict = col_character(),
  fftCategory = col_character()
)

You’re supposed to copy and paste that into a new call, putting right any mistakes. And in fact there is one, in this very spreadsheet, the parser incorrectly guesses that Inpatient is character when it is in fact integer- because the first 1000 rows are missing.

So just copy all that into a new call and fix the mistake, like this:


testSpec = read_csv("masterTest.csv", 
                    col_types = 
                      cols(TeamN = col_character(),
                           Time = col_integer(),
                           TeamC = col_double(),
                           Division = col_integer(),
                           Directorate = col_integer(),
                           Contacts = col_integer(),
                           HIS = col_character(),
                           Inpatient = col_integer(),
                           District = col_character(),
                           SubDistrict = col_character(),
                           fftCategory = col_character()
                      ))

If you’re still having problems, you can have a look using problems(testSpec).

Absolute pure genius. The more I use the tidyverse, the more I know about it, and the more I know about it, the more I love it.

Analysing runs from the Polar web flow service

Well, we’re still in New Year’s resolutions territory, so what better time to have a look at using R to analyse data collected from a run? For this analysis I have used the Polar Flow web service to download two attempts at the same Parkrun, recorded on a Polar M600 (which I love, by the way, if you’re looking for a running/ smartwatch recommendation).

The background to the analysis is in the second of the two runs I thought I was doing really well and was going to crush my PB and it ended up being exactly the same as the previous run, in terms of total time taken, but with my heart rate a lot lower.

But I didn’t really feel like I wasn’t pushing myself hard enough, so I can’t really explain why my heart rate has dropped so much without a corresponding increase in performance. One possible explanation is I have moved from being bottlenecked by the performance of my cardiovascular system to being bottlenecked by the performance of my legs, but that these two bottlenecks are very similar in terms of where they cap my pace.

It was pretty fun having a look in R. Here’s a link to the analysis as it stands.

I thought I would look at my race strategy in terms of how fast I went at each point, reasoning that maybe I let myself down on the hills or the straights or something in the second attempt. However, as you can see the pace is absolutely identical the whole way in both runs. The heart rate, as you can see, is consistently lower in the second run, and it only creeps up at the end for the sprint finish (which makes me wonder if I really was pushing myself hard enough).

I need to do more analysis. My next idea is to look at the relationship between incline, heart rate, and pace (the route is pretty hilly so this is quite important).