SOF newbie here, so please bear with me and some daft questions!
Can anyone please tell me what the most efficient way would be to read in ~50 excel files with multiple tabs into R, select a specific tab from all 50, and combine those tabs by date to create a large dataframe?
For example, let's assume I want to see what GP activity has been like in a specific healthcare region here in the UK for the last 4 years. I currently have to download many many monthly reports like this one Appointments in General Practice, March 2022
I then have to go to table 3a, scroll down to the NHS area code "07k", and then paste the activity from that row into a separate excel sheet.
I have to do this a lot, and it takes hours, but am not coding literate enough to figure out a better way of doing this.
From the guidance of others, my attempts to date have involved downloading many months' data to a desktop folder. I set my working drive to that folder, and then attempt the following:
files <- list.files(pattern = "*.xls", full.names = TRUE)
for(i in files){
filepath <- file.path("C:/Users/myname/Desktop/recordsfolder", i)
assign(i, readxl::read_xls(file path))
}
But the honest truth is I don't really understand this, nor what the next step would be.
If anyone could point me in a better direction I'd be very grateful.
CodePudding user response:
Okay so there is a function readData that reads and filters each of the table 3a's from the files in list.files. You then apply this function to all of the files using map and then bind_rows to reduce the list of datasets to a single dataset.
I dont know how standardised your reports are so you may need to adjust the read function accordingly
library(readxl)
library(tidyverse)
readData <- function(path){
colnames <- read_excel(path, sheet = "Table 3a", skip = 10, n_max = 1) %>%
gather(header1, header2) %>%
mutate(across(everything(), ~gsub("\\.\\.\\.\\d ", "", .))) %>%
mutate(colname = ifelse(header1=="", header2, paste0(header1, "_", header2))) %>%
pull(colname)
read_excel(path, sheet = "Table 3a", skip = 13, col_names = colnames) %>%
filter(grepl("07k", `NHS Area Code`, ignore.case = TRUE))
}
files <- list.files("C:/Users/myname/Desktop/recordsfolder", full.name = TRUE)
files %>%
map(readData) %>%
bind_rows()
CodePudding user response:
One approach is improve your workflow is to move away from excel files, whenever possible, especially if CSV files are already available.
For example, instead download the zip file https://files.digital.nhs.uk/EA/072466/Appointments_GP_Daily_Mar_22.zip and then read in the Mar 2022 csv, using data.table::fread()
library(data.table)
mar22 <- fread(cmd = 'unzip -p Appointments_GP_Daily_Mar_22.zip CCG_CSV_Mar_22.csv')
You can then construct information for Table 3a. For example, appointment counts by status and CCG:
dcast(
mar22[, sum(COUNT_OF_APPOINTMENTS), .(CCG_NAME,APPT_STATUS,CCG_ONS_CODE)],
CCG_ONS_CODE CCG_NAME~APPT_STATUS, value.var="V1"
)
Output:
CCG_ONS_CODE CCG_NAME Attended DNA Unknown
1: E38000006 NHS Barnsley CCG 118066 5545 5595
2: E38000007 NHS Basildon and Brentwood CCG 121998 3572 4626
3: E38000008 NHS Bassetlaw CCG 61864 2293 1874
4: E38000014 NHS Blackburn with Darwen CCG 76233 4228 3243
5: E38000015 NHS Blackpool CCG 88336 5267 3362
---
102: E38000253 NHS Hampshire Southampton and Isle of Wight CCG 763293 36026 30755
103: E38000254 NHS Kirklees CCG 219937 7859 7637
104: E38000255 NHS North East London CCG 865775 57636 42749
105: E38000256 NHS North West London CCG 1127325 59500 45858
106: E38000257 NHS Shropshire and Telford and Wrekin CCG 236456 9836 9884
