Shiny modules for beginners

I’ve seen some discussion on the Internet about whether people learning Shiny should start with modules or if they’re an advanced topic. I’m not going to link to any examples because this post is definitely not me saying- hey look at this rubbish over here, here’s why it’s wrong. I’ve seen a couple of posts talking about it. But I just thought I would chip in my perspective.

I really love modules and they solve a lot of problems, and I think they could be a great thing for a beginner to learn. No reason at all why beginners can’t learn them. The only caution I would give those learners is that some things can be weird or fiddly with modules. So if you’re learning at your own pace, and want to do everything really neatly and use industry standards, great, go for it. But if you want to throw up a Shiny application to impress your boss (for example) and learn by doing then you may find yourself going down some very long, dark rabbit holes trying to get the particular thing you want working. (and yes, I am writing this post now because I have been down a few myself recently πŸ˜€)

As I say, purely my perspective generated from sitting at my keyboard and interacting regularly with maybe five people learning Shiny. Your mileage may vary

Getting started in data science for healthcare

I just recruited to a data science post at Nottinghamshire Healthcare NHS Trust and I’ve been asked for feedback by several people about how to get started in data science in healthcare, and indeed how to get a job in data science in healthcare. I will also be recruiting another data scientist for our team in a few months. This post is designed to give you a feel for what we were looking for in the job that I just recruited to, as well as to the job that is coming up. They’re slightly different skill sets as I will discuss.

I should say at the top that I am pretty much a person on the Internet, there are people much better qualified than I amΒ to answer this question, and there are much better jobs out there than the ones I have advertised and will advertise. My trust is a great place to work, and we have a lot of freedom to innovate, and we are big believers in using and publishing open source code. I hope we’re a friendly and supportive team too. But I’m sure there are many out there that do all this and more, and better. With all that said, let’s discuss what I was looking for last time, what I’m looking for next time, and what you can do to get the skills for these or any other similar jobs.

The recent post I advertised was for a band 7 data scientist, one year contract, starting salary of Β£38K. It’s for a funded piece of work designing algorithms that can read patient feedback and tell you: what it’s about (parking, food, communication…), how positive or negative it is (you saved my life, this service is a disgrace, the ward was dirty, the food was cold…) and making a dashboard so people can interrogate their data. I don’t want to say too much about it here, otherwise this post will get way too long.

The first thing to say is that the field was VERY strong. I didn’t think we could get quite as much interest from quite as many talented people as we did. We were definitely spoiled for choice but the areas that ended up being very important were.

  1. Machine learning. Because it’s a one year project, with a specific goal, we were looking for someone who could just run with a machine learning project and could build a pipeline with real data
  2. Team working. The ability to help more junior staff, and to work with non technical staff is absolutely vital in such a small team. We were looking for people skills as well as “technical” team working skills- version control, package management, documentation. Avoiding “it works on my machine”, basically
  3. Communication. Some people are completely on board with data science and trust experts to just get on with it and can engage them about what they want and ask the right questions. And some people will sit with their arms folded (literally or figuratively) and wait to be convinced. And a strong organisation needs both. Cheerleaders and critics. And we were looking for someone who could navigate a meeting with a healthy dose of each, as well as someone who could engage the straight up disengaged

If you want to get good at this stuff, there are lots of things you can do. This book is excellent if you want to go the Python route. Kaggle is a great resource for datasets and how-to’s. Try to get some real data in your workplace and do some ML with it. Be a mentee. Be a mentor. Work with your team to get better practice around version control and code style. Sell it to your bosses. Even if they’re idiots and don’t listen, it’s all experience. Come to the interview and say that. “I identified several areas of data and analytic practice that could be improved by x, y, and z. I wrote them a five point plan explaining the training their staff would need, the likely benefits and pitfalls. They are idiots and didn’t listen.”

If you already work in the public sector I highly recommend the data science accelerator. Twelve days, one a week, with a data science mentor. I was a mentee a year and a half ago. Then I was a mentor. Doing one or both is a really good way of getting out of your silo, seeing new datasets and challenges, and learning to ask and answer the right questions. Besides my PhD it’s the best thing I ever did, career wise.

The next job is going to be a bit more general purpose. The team has quite a few pieces of smaller paid work coming in and I need somebody who can help out as well as work in the Trust building on the work we’re already doing: statistics, machine learning, and publishing either as a document or a Shiny powered dashboard. We’re looking for similar stuff as last time, but not as focused on machine learning, perhaps someone with some stats knowledge- regression, experimental design, measures of association, all that stuff. I feel like R would be a big advantage, but if somebody can do all that in Python then that’s good too, and ditto other things like Julia. They would probably end up learning some R just to read the stuff we’re all writing, and we’d all end up learning Python/ Julia just to read their stuff. All to the good.

Again, we need someone who can work in a team (no hero coders) and communicate at all levels of the Trust and beyond. We need someone who can teach, and learn, and someone who can convince the sceptical and engage the disengaged. Something I would absolutely love to be able to recruit to is Linux server maintenance experience. I’m in charge of two Linux servers that do our work for us, one in the cloud, one behind the firewall (see this blog, passim) and I would love to be able to give the keys to somebody who knows what they’re doing. Even more, I’d love for them to do the next upgrade. Going to RStudio Connect with Ubuntu 20.04 was hideous (not the fault of RStudio or Ubuntu, obviously, purely my fault πŸ˜€) and having someone be able to worry about that would have been wonderful.

Some people have told me I’m wasting my time managing my own servers as a data scientist and I really need to get proper IT support, and maybe they’re right, but I’ve come this far, even if I am doing it wrong. I’ve been learning Linux server stuff on my own server for 7 years. If you want to learn this way without buying your own cloud server then just buy a Raspberry Pi and ssh into it from behind your firewall at home. Set up a LAMP stack, set up WordPress, set up the free versions of RStudio Server and Shiny Server, run plumber APIs, run PHP, write a Django application, whatever. You’ll get in the most hideous mess and tear your hair out for entire weekends and you’ll look back at those weekends fondly and be glad of all the learning you did that day πŸ˜€. And if you get in a complete mess just wipe the SD card and start again.

As I say, this is all caveat emptor. Your mileage may vary. But if you want my opinions then these are they. Keep an eye on my Twitter if you’re interested in the job, and feel free to @ or DM me, or indeed send an email, my address is in my Twitter profile.

RStudio Connect behind the firewall

This is part II of what would otherwise have been a far-too-long post about configuring RStudio Connect. A bit of back story, particularly for those of you who might have hit this from a Google search (which does happen, JetPack tells me) and don’t know who the heck I am and what I do all day. Here’s what I said in part I:

I’ve been using RStudio stuff on the server for a long time. I started using Shiny community edition back in 2013 for an application that is totally open and so doesn’t need authenticating. Then two years ago I started deploying Shiny applications that people authenticated to behind our Trust firewall using Shiny Pro. I have wanted to use RStudio Connect for a long time but it was hard to get the funding together for it given how things are with austerity since the banking crisis.

[also, I say later on in the first post- I am NOT DevOps. I’m just a random data scientist trying to get on with his job. So if this is all hilariously wrong/ dangerous/ time consuming, don’t say you weren’t warned].

I work for the NHS in the UK and I have an installation of Shiny Pro behind our firewall. It’s running Ubuntu 20.04 on a VM hosted in a Microsoft environment. It authenticates against Active Directory using the LDAP feature of Shiny Pro and uses HTTPS and LDAPS (of course). Authentication to the MS SQL server is done with Kerberos. A cronjob runs overnight pulling data from the data warehouse ready to be loaded into Shiny applications the next day.

The script is simple. Move from Shiny Pro to Connect. The first bit was easy. The LDAP/ AD configuration file looks a little different, but the nice man from IT and I got it working on the second try. HTTPS, LDAPS, also just pretty much cut and paste. So far so good. I haven’t configured the email bit yet, partly because we don’t really need it and partly because I’m not really sure if my Trust will see a Linux server under my control firing off emails on a schedule might pose an IG problem. They would all be routed through the mail servers of my Trust, so it’s no different to me just sitting and sending out emails, but you know what they say, to err is human but if you really want to mess things up use a computer. You don’t need to set up email as long as your authentication is all sorted and the authentication works beautifully.

When you publish something you can very easily add users and groups to the people who can view it, and it even does autocomplete. So for example all the relevant groups for the main suite of applications start with “RS_T”. So if you type “RS_T” into the “who can view” box it automatically shows you all the groups that you can add. And you can add people from the staff directory who have never logged in, so you can just add them to everything, send them a link, and they’re off, all using their network password. Beautiful. And as I said in my previous post, that isn’t just ME doing that, it’s any publisher. So the other data scientists can just publish stuff, and add people and groups, and use their exact version of R and their exact version of packages. Compare that with me Filezilla’ing application folders onto a Linux server, testing them, finding they don’t work, and then emailing the person who wrote them saying “it doesn’t work, any idea why?” and them saying “oh I’m using such and such version of tidyr” and me saying “oh I’m not running that on the server, hang on” and back and forwards and… you get the idea.

So that’s what you gain. You can hand off all that responsibility to other people and just do your own thing. But that does come at a price (I’m not saying the price isn’t worth paying. I’m not saying it IS worth paying. I’m just helping you get your head around the migration). There are two things that I got really used to doing with a Shiny Pro installation that you cannot do on a Connect installation that can give you a headache. It’s just my perspective, obviously, if I’d never used Shiny Pro I wouldn’t have this perspective, but I think it can help you understand how it works.

The first most obvious thing that you will miss (and also not miss and think “good riddance”) is the file system. As far as I can tell you have NO access to the file system. You can’t pop .Rdata objects in /srv/shiny-server/applications/data_store and then load them from several Shiny applications. You can deploy them straight from your hard drive via the RStudio IDE to the server, for example if you’re writing a quarterly RMarkdown report and you have the data sitting on your computer. Or you can use the pins package (https://github.com/rstudio/pins). The pins package allows anybody with publishing rights to create a data object on Connect that they and other publishers can point to with their own documents/ applications. Again, it’s a nice way of allowing people to deploy stuff (data in this case) without touching the terminal or using Filezilla or whatever. The related thing that you do not have is cron. Well, you do have cron but without a file system you can’t do anything with it (πŸ™ƒ). So for my use case, where I have a cronjob that processes data and places it somewhere in the file system the conversion for me is to have a scheduled RMarkdown report that does that processing and then uses the pin package to place it on the server where everyone with the right access level can see it. This is nicer, really, because it democratises the sharing of data, otherwise it all has to be placed manually onto a Linux file system. In my case I’m using Kerberos, which means that the top of the script has to get a Kerberos ticket:

system("kinit USERNAME -k -y KEYTAB.FILE")

Which is totally fine, it just feels really weird and it really didn’t occur to me to do it until I asked somebody at RStudio. But note that the KEYTAB.FILE can’t access the file system either- so you have to deploy it with your application otherwise it won’t be found. You can’t just pop it somewhere safe on the server and forget about it. As I say there’s nothing wrong with this it just feels like putting my jacket sleeves on in the opposite order to how I normally do- fine but weird.

The second thing that you will miss (and, again, think “good riddance”) is the ability to launch R in the terminal on the server, load your packages, run some code, see that it works, shut down the terminal, publish the same code via Connect and have it work every time. The specific set of packages that you have installed on your server, whether you’re running it as the root user or as yourself, bear no relation at all to the packages that will run on the Connect server. Which sounds totally fine but I got myself in a position where the server could run the code quite happily via the terminal, and so could my computer, but it failed when I published it to Connect. This was because the package version that was running on my computer, even though it worked on my computer, wouldn’t install on the server, and although I could install a different, older version of the package, that didn’t help, because Connect doesn’t care what’s on the server, it wants the version that’s on your computer. There is actually a way round this which RStudio encourage you to avoid where possible, and that is to define some packages as using the server version of the package. You can read about this here https://docs.rstudio.com/connect/admin/r/package-management/. It looks like this in the config:

; /etc/rstudio-connect/rstudio-connect.gcfg
[Packages]
External = ROracle
External = RJava

Just be warned that even though they caution against it I had to use it to resolve some conflicts which I think were caused by different versions being available and current on Windows (where the IDE is) and Linux (on which Connect runs).

Phew! That was a long post. I think that’s all I know about that. Feel free to find me on Twitter or email or (in happier times) at a conference and have a chat about it if you’re interested.

RStudio Connect in the cloud

I’ve been using RStudio stuff on the server for a long time. I started using Shiny community edition back in 2013 for an application that is totally open and so doesn’t need authenticating. Then two years ago I started deploying Shiny applications that people authenticated to behind our Trust firewall using Shiny Pro. I have wanted to use RStudio Connect for a long time but it was hard to get the funding together for it given how things are with austerity since the banking crisis.

NHS organisations can now get a discount for RStudio Connect and I have finally obtained a multi server licence. It runs in the cloud, and I have set up a password based login for this server to allow us to serve applications to people in our region without their being behind our firewall and without opening up the data for everyone to see (it’s not row level patient data, it’s only summaries, but everyone is more comfortable with it being behind a password). I also have it running on my firewalled server doing the work that Shiny Pro used to do.

I didn’t think deeply about it before I started (more fool me, really), and so I was surprised how different Connect is to Shiny Pro. I’m writing this to give some of my experiences to smooth out the learning curve for others.

I’m going to slightly artificially split this blog post into Stuff I Did In The Cloud and Stuff I Did Behind The Firewall, even though a lot of it could be written under either, just to stop this from being The Longest Blog Post Ever Written About R. If you’re interested in what I’m saying (make up your own mind about that πŸ™‚) then you should read both.

So far all I’ve done on the cloud server is take a Shiny application that works on my machine, deploy it to Connect, and then share it with authenticated users. I will be doing more stuff with MySQL integration and cron, and some other stuff, but I’m going to talk more about databases when I talk about the firewalled one (which is doing it on hard mode, basically).

I wanted to authenticate people without the help of preexisting corporate resources (like Active Directory or LDAP). There are two main ways of doing this. Use a cloud based authentication system (like OAuth with Google) or just get Connect to handle it all. I didn’t really want to force people to use their Google accounts (if they even have them) so I got Connect to deal with all the passwords. This works fine. I think it’s a bit odd that you have to use a terminal based tool to delete users and can’t just do it on the GUI but no big deal. However, it’s worth noting that in order to get this to work you will need an email server on the host machine. As will be clear to anybody reading I am not DevOps. I’m just a data scientist who wants to spin up the server and then forget about it and crack on with putting Shiny applications up. I managed to get Shiny server community edition working pretty easily some years back and with quite a lot of hard work and help also got Pro running behind our firewall (with Kerberos, LDAPS, all that).

With Connect and needing a mail server that’s another leap forward in terms of your Linux skills. It’s not so much getting postfix working, that’s actually fairly simple, it’s all the configuration necessary to convince other people’s mail servers that you’re not sending them spam. Fortunately I actually did that a few years back implementing a laughably shonky cron job that picked up important pieces of data and emailed them to managers overnight if there was anything they needed to see. I needed to add the server to a special allow list to get through NHS email filters, even. It’s lucky for me that this work was already done because with my pretty limited knowledge this would have been a stumbling block.

I don’t even know if this stuff about Linux skills is important. I’ve been doing this work with servers for some years now, cheap and cheerful, no DevOps support, just to get things working. Some people on Twitter seem to think that it’s a fool’s errand and you absolutely need proper Linux support in your organisation to use RStudio products. It has not been easy for me to learn and do all this stuff, and if I’m honest this isn’t really what I was dreaming about doing when I was doing my psychology PhD. But the fact remains that Linux support is rare to non existent in the NHS and this whole journey has been either me doing it or nobody doing it. I feel that my organisation has made great strides with R, partly because of the work I have put in messing around with the servers, but mainly because of the enthusiasm and talent of the people writing the R. I’ve never been able to look these people in the eye and say “I’m sorry, our organisation just doesn’t support that and neither do I” and so here we are. Draw your own conclusions. I claim no expertise in any of this. Our servers definitely do what they say on the tin. That’s as far as I’ve got.

Once I’d got all the stuff with the email sorted it was plain sailing. It’s worth knowing that Connect can store multiple versions of R and multiple versions of packages so the people deploying to the server can deploy their exact configuration to the server. The big change for me is that other people can put their own stuff up and (if they’re an admin) they can also invite other people to Connect and give them access to applications. So it’s sort of fire and forget, which is wonderful, obviously.

It’s worth noting also that if you’re collaborating on an application everyone needs the rsconnect/ folder which contains details of where and how to deploy the application. You can put that in a shared area or just check it into GitHub and anyone with publisher rights can update one of your applications.

Just as RStudio have been telling me, Connect is really a platform for data scientists to publish their own work. It allows you to give all of your data scientists the tools they need to deploy reports and dashboards and to sign people up and authenticate them, all without touching the server. And that is very liberating for me.

There’s a mental handbrake turn when we get to working with databases and data, that is relevant in the cloud too, but I’m going to talk about it in the next blog post, because that’s where I encountered it first and where I had the maximum amount of head scratching with it.

Productionising R at Nottinghamshire Healthcare

I’m hopeful we’re moving into a bit of a new phase with using R in my Trust so I thought I’d outline the direction of travel, to see if it chimes with anyone else and just to keep people up to date about what we’re doing.

We’ve used Shiny for some years now, maybe 7, and we have applications behind the firewall and in the cloud which are well used by the staff who need them. We’ve also been building our skills (R and data ops, which is a whole other post) and I’m hopeful that we’re ready for the next phase, which I would call the “productionise” phase. There are two main tasks to achieve before we’ve really got the work that we’re doing nicely embedded in everyday practice at Nottinghamshire Healthcare.

The first thing we’re looking at at the moment is using the data warehouse better. We have a very large and well featured data warehouse and it does loads of really great stuff but we (and I’m the worst offender for this) are still relying too much on pulling stuff out of it to say, Excel, and then reading that into R and off we go with an analysis. It’s a lot of work getting it out and then you can’t really communicate with the data warehouse people easily about what you did because you’ve introduced this whole other layer on top messing around with Excel sheets and you can’t easily tell them what you did. So the first bit for me (and this is not revolutionary in any way, we’re just building up to something) is to use the {odbc} and {pool} package, with shiny, to directly interface with the data warehouse, do the analysis, deploy it as a Shiny application, and just have it living live on the data warehouse. It’s striking straight away that when you do that suddenly you’re talking the same language as your BI team and you can communicate to them what you’ve done and give them to tools to reimplement it themselves if they want to.

And the second bit we need to get right is productionising the Shiny applications themselves. I’ve been looking at the {golem} package to do this. Golem really appeals to me for two reasons. Firstly, because it provides a robust framework for modularising code. I’m starting to realise that I do write the same Shiny code over and over again. Something in particular that I have done a lot is write code that makes sense of a spreadsheet. Ultimately I would like to write a module that can take an incredibly messy spreadsheet and with a few clicks from the user tidies it up ready for further processing. And with the right level of modularity I could use that over and over again. I did start writing one a little while ago but I got distracted.

I’ve got a few projects coming up that exemplify the process and an abstract submitted at the R in medicine conference so hopefully I’ll be back to say more about this all in due course with some real examples. In the meantime there’s some bare bones golem code here and here but it’s early days so please don’t judge me πŸ™‚

app.R and global.R

I’m doing some Shiny training this year and I want to teach whatever the new thinking is so I’ve been reading Hadley Wickham’s online book Mastering Shiny. There’s a couple of things that I’ve noticed where Shiny is moving on, so if you want to keep up to date I suggest you have a look. I’m going to pick out a few here. Firstly, note that in Shiny 1.5 (which is not released at the time of writing) all code in the R/ directory will be sourced automatically. This is a very good idea, I’ve got loads of source(“useful_code.R”, local = TRUE) lines in some of my applications, so it gets rid of all that.

Something else that’s different is modules are working slightly differently, which I haven’t bothered to absorb yet, I’ve enough on my plate with {golem} but if you’re using modules I suggest to keep abreast and have a look at the section on modules.

The thing that has just pricked up various ears on Twitter, though, is the lack of an option on the RStudio option to create separate server.R and ui.R files when creating a new Shiny application through the wizard. I was very surprised when I noticed this. And indeed if you look at Hadley’s book there is no mention of server.R and ui.R, it’s all just app.R. He suggests that if you are building a large application you hive off the logic (in Shiny 1.5, in the R/ folder, or for now by calling source(“file.R, local = TRUE).

But then straight away you’re wondering about where that leaves global.R. For the uninitiated, global.R is a separate file that’s really useful when you’re deploying to a server because it is run only once when the application is run the first time and then will serve its contents to any applications. Its contents is also available to ui.R, which can be helpful setting up the UI based on what’s in the data.

As I mentioned I want to teach where the world is going so I’m trying to do things the new way (I have never taught app.R because, honestly, I hate it, but I guess with the new source R/ folder I can see the reasoning, and I’m not going to argue with the folks down at RStudio anyway πŸ™‚ )

I wasn’t sure about where app.R fitted in with global.R and running on a server so I have written a test application. You can see the code here, I’m sorry it won’t run because you don’t have the data and being honest I can’t be bothered to deploy it properly somewhere where you can get it but you get the idea.

The first time the code runs the Sys.sleep(10) runs and you get a big pause. But, sure enough, when you go back it doesn’t run and you get straight in. You can see also that the contents of the datafile are available to the UI (choices = unique(ae_attendances$Name). Lastly, take my word for it that if you add a file in to the folder called “restart.txt” then it will rerun (and generate the 10 second pause) the next time you go to the application, just as I used to do with global.R.

That’s all I know at the moment. I hope this is clear and useful (and correct!), it’s all based on stuff I have cobbled together today looking at Hadley’s book and messing around.

Data science for human beings

Someone just emailed me to ask me about getting into data science. They knew all the usual stuff, linear algebra, Python, all that stuff, so I thought I’d talk about the other side of data science. It’s all stuff I say whenever I talk about data science, but I’ve never written it down so I thought I may as well blog it.

There are three things that are probably harder to learn that will make you stand out at interview and be a better data scientist.

First. Understand your data. I work in healthcare. Some of the data I work with is inaccurate, or out of date, or missing, or misspelled, or just plain wrong. It’s my job to understand these processes. There’s a saying in healthcare, that goes something like “60-80% of apparent differences in healthcare providers are related to different practices with data”

Second. You work in a team. With other data scientists, and with the wider organisation. Don’t be a hero coder, go off to your bunker and write this piece of genius that nobody understands and nobody wants. Work agile. Get buy in as you go. Mentor people. Be humble. Listen to what people want. Don’t do analysis because it’s flashy and cool. Build what your users want. Get to know them. Understand UX and UI. There’s a saying I like that goes “We’re all smart. Distinguish yourself by being kind”. Be a team player. Share the glory, and the blame.

Third. Have an opinion. Don’t just learn every method going and apply them according to whatever medium.com posts are saying this month. Scan the horizon. Find new stuff. Dig out old stuff. Think critically about the work that you and others are doing. And sometimes, when you can, go in large. So far in my career I’ve bet large on Shiny and text mining, and they’ve both paid off for the organisation I work for and for me. My latest pick is {golem}. I think it’s going to be massive and I want to be near the front of the pack if and when it is. Trust yourself. It’s your job to support your organisation with their priorities, but it’s your job to know stuff they don’t know and to push your organisation along a bit. I’ve never done anything really significant in my career that somebody asked me to do. I’ve pitched all my really significant projects, although obviously I spend most of my time building stuff people want and ask for (see point two).

NHS data science and software licensing

I’m writing something about software licensing and IP in NHS data science projects at the moment. I don’t think I ever dreamed about doing this, but I’ve noticed that a lot of people working in data science and related fields are confused about some of the issues and I would like to produce a set of facts (and opinions) which are based on a thorough reading of the subject and share them with interested parties. It’s a big job but I thought I’d trail a bit of it here and there as I go. Here’s the summary at the end of the licences section.

The best software licence for a data science project will vary case by case, but there are some broad things to consider when choosing one. The most important decision to make is between permissive and copyleft licences. Permissive licences are useful when you want to maximise the impact of something and are not worried about what proprietary vendors might do with your code. Releasing code under, for example, an MIT licence allows everybody, including individuals using proprietary code, a chance to use the code under that licence.

Using a copyleft licence is useful when there is concern about what proprietary vendors might do with a piece of code. For example, vendors could use some functionality from an open source project to make their own product more appealing, and then use that functionality to sell their product to more customers, and then use that market leverage to help them acquire vendor lock in. Vendor lock in is the state in which using a company’s prodcuts ensures that you find it very difficult to move to another company’s products. An example might be using a proprietary statistical software package and saving data in its proprietary format, making it difficult to transfer that data to another piece of software. If proprietary software companies can use code to make the world worse in this way, then choosing a copyleft licence is an excellent way of sharing your code without allowing anybody to incorporate it into a proprietary codebase. Proprietary software companies are free to use copyleft licensed code, but are highly unlikely to do so since it means releasing all of the code that incorporates it.

The Great Survey Munge

As I mentioned on Twitter the other day, I have this rather ugly spreadsheet that comes from some online survey software that requires quite a lot of cleaning in order to upload it to the database. I had an old version written in base R but the survey has changed so I’ve updated it to tidyverse.

And this is where tidyverse absolutely shines. Using it for this job really made me realise how much help it gives you when you’ve got a big mess and you want to rapidly turn it into a proper dataset, renaming, recoding, and generally cleaning.

It must be half the keystrokes or even less than the base R script it replaces. There are some quite long strings in there, which come from the survey spreadsheet, but that’s all just cut and paste, I didn’t write anything for them.

I love it profoundly, and I bet if I was more experienced I could cut it down even more. Anyway, here it is for your idle curiosity, it is obviously of no use to anybody who isn’t working on this data and to be honest there are bits that you probably won’t even understand, but just a quick look should show you just how much I had from the wonderful tidyverse maintainers

https://gist.github.com/ChrisBeeley/488c3a8fa35b57d8b40232d70e1dfdc9

Rapidly find the mean of survey questions

Following on from the last blog post, I’ve got quite a nice way of generating lots of means from a survey dataset. This one relies on the fact that I’m averaging questions that go 2.1, 2.2, 2.3, and 3.1, 3.2, 3.3, so I can look for all questions that start with “2.”, “3.”, etc.

library(tidyverse)

survey <- survey %>% bind_cols(
  
  map(paste0(as.character(2 : 9), "."), function(x) {
    
    return_value <- survey %>% 
      select(starts_with(x)) %>% 
      rowMeans(., na.rm = TRUE) %>% 
      as.data.frame()
    
    names(return_value) <- paste0("Question ", x)
    
    return(return_value)
  }) 
)

This returns the original dataset plus the averages of each question, labelled “Question 2”, “Question 3”, etc.