I have several .txt files that I would like to read and then rbind in R. I expect that each .txt file generate 1 line and 115 columns. First problem: I’m facing the following Warning message: “incomplete final line found by readTableHeader on…” But I have several files and I can’t navigate to the last line of each file and press Enter. Some solutions I found in the Internet didn’t work because of the following second problem.
Second problem: the column names (Columns) and the content of the columns (Rows) have no separator. The .txt files looks like this: "DIARREIA":1,"DISPNEIA":2, note that "DIARREIA" and "DISPNEIA" are column names while 1 and 2 are column contents. There are colon (:) between the name of the column and the content of the column.
Here is my code and 2 files as examples are available at https://drive.google.com/drive/folders/16U8J12Ld7PI5DI-ph_2QCysTxFGKZ-QP?usp=share_link.
````setwd("C:/User/BOX")
unzip("C:/User/BOX/data.zip")
list.files()
temp = list.files(pattern = "*.txt")
df = do.call("rbind", lapply(temp, function(x) read.table(x, stringsAsFactors = T, header = TRUE)))```
Any help, please? Thanks in advance!
CodePudding user response:
Hello Baptista: install jsonlite if you dont installed it and try this:
# this line installs jsonlite
if(!("jsonlite" %in% installed.packages())) install.packages("jsonlite")
setwd("C:/User/BOX")
unzip("C:/User/BOX/data.zip")
temp <- list.files(pattern = "*.txt")
df <- do.call("rbind", lapply(temp, jsonlite::read_json))
CodePudding user response:
You've found yourself some Debian Control File medical records. ?read.dcf and the explanation of a properly formed .dcf file. You can get this result
subject1_2_4
subject PERDADEPALADAR1 PERDADEPALADAR ALTOFLUXOCATETERNASAL
1 1 false, false 1 false
2 2 NA 2 false
INSUFICINCIARENAL1 DATADEALTADAUTI DATADEADMISSOUTI
1 false
2 false 9\\/17\\/2020 12:00:00 AM 9\\/12\\/2020 12:00:00 AM
IMUNOMODULADORQUAIS DATADAALTA SITUAODOCASODESRAG DIARREIA
1 10\\/6\\/2020 12:00:00 AM 0 1
2 9\\/19\\/2020 12:00:00 AM 1 2
DESFECHODOPARTO CLOROQUINAHIDROXICLOROQUINA LINFOCITOPENIA1
1 -1 false false
2 -1 false false
OUTROSSINTOMASPERSISTENTES PO2 DISPNEIA OXIGENOTERAPIA
1 Ansiedade false 2 true
2 false 1 true
INSUFICINCIARESPIRATRIA PROFISSIONALDESADE TRIGLICRIDES FERRITINA1
1 0 2 false false
2 1 0 false false
DATAADMISSAO TOSSE1 DOENAHEMATOLGICACRNICA DDIMERO1 PARTO
1 9\\/24\\/2020 12:00:00 AM false false false 0
2 9\\/16\\/2020 12:00:00 AM false false true 0
COINFECOES SNDROMEDEDOWN PERDADEOLFATO DIABETESMELLITUS RENDAFAMILIAR
1 1 false 1 true
2 1 false 2 false
SATURAOO2 VENTILAOMECNICAINVASIVA DDIMERO
1 96 false false
2 96 false true
ANTIBITICOSQUAISETEMPODEUSO
1 Ceftriaxona 2g 24\\/24h 3d\nTazocin 4.5mg 6\\/6h 7d
2 Azitromicina 500mg 24\\/24h 5d\nCeftriaxona 1g 24\\/24h 7d
TRABALHODEPARTOPREMATURO VENTILAOMECNICAEMPOSIOPRONA OUTRASCAUSASDEADMISSOUTI
1 0 false
2 0 false
OUTRASSEQUELAS DATARESULTADOCONFIRMATRIOPARACOVID TOSSE DOENCAHEPTICACRNICA
1 8\\/1\\/2020 12:00:00 AM 2 false
2 9\\/17\\/2020 12:00:00 AM 1 false
PROTENACREATIVA1 ARTRALGIADORNASARTICULAES ENCAMINHAMETODEOUTROSERVIO ASMA
1 false false 2 false
2 false false 2 false
TRIMESTREDEGESTACAO PO21 INSUFICINCIARESPIRATRIA1 TIPODEPARTO OBESIDADE
1 false false -1 false
2 false true -1 false
FRAQUEZA OUTROS VOMITO DHLLDL1 IVERMECTINA
1 false Febre\ncoriza 1 false false
2 false Piora do quadro geral 2 false false
DIAGNSTICOCLNICOINICIAL ADMISSOUTI ALTOFLUXOMASCARA VITAMINAC FADIGA
1 Pneumonia e COVID 2 false false 2
2 Pneumonia e COVID 1 true false 2
PROTENACREATIVA VITAMINAD QUAISCOINFECES IMUNODEFICINCIA COCLHICINA
1 false false Pneumonia false false
2 false false Pneumonia bacteriana false false
ONDEFOIREALIZADOOPRIMEIROATENDIMENTODOPACIENTE
1 6
2 6
ANTICOAGULANTEQUAISETEMPODEUSO1 CONTATODE FALNCIADERGOS SEPSE PERDADEOLFATO1
1 Clexane 40mg 24\\/24h 12d 0 false 0 false
2 Clexane 40mg 24\\/24h 7d 1 false 0 false
INSUFICINCIARENAL EXPOSICAO DORABDOMINAL CHOQUE TCNAINTERNAO
1 0 -1 2 false 0
2 0 -1 2 false 2
DESCONFORTORESPIRATRIO DHLLDL ANTIVIRAISQUAISETEMPODEUSO NITAXOZANIDA
1 2 false false
2 2 false false
DATA SEPSE1 DOENANEUROLGICACRNICA ZINCO PACIENTEGESTANTE
1 8\\/27\\/2022 12:00:00 AM false false false 0
2 8\\/26\\/2022 12:00:00 AM false false false 0
OUTROSSINAISDEGRAVIDADE TIPODEEXAME DOENCACARDIOVASCULARCRNICA
1 0 true
2 0 false
PARALISIADEDOENTECRTICO DOENARENALCRNICA1 TEMPERATURA
1 false false 36\n9
2 false false 36\n5
FATORESDERISCOPARAGRAVIDADEEMGESTANTE INSUFICINCIACARDACA TRIGLICRIDES1
1 -1 false false
2 -1 false false
FALTADEAR AMNSIAESQUECIMENTO CORTICOIDESQUAISETEMPODEUSO LINFOCITOPENIA
1 false false Dexametasona 6mg 24\\/24h 10d false
2 false false Dexametasona 6mg 24\\/24h 7d false
OUTRAPNEUMOPATIACRNICA DORDEGARGANTA DESFECHOCLNICODOPACIENTE FIBROSEPULMONAR
1 false 2 1 false
2 false 2 1 false
BAIXOFLUXOCATETERNASAL RACA MIALGIADORNOCORPO DOENARENALCRNICA FERRITINA SEXO
1 true -1 false false false 0
2 true -1 false false false 0
PARADACARDIORRESPIRATRIA MIALGIA PURPERA ESPECTROCLNICOADMISSO TROMBOSE
1 false 2 false 1 false
2 false 2 false 1 false
ENDERECOTIPO
1 0
2 0
>
But there is a certain amount of mucking around to do, that can be done in R, likely easier in a text editor. With the .dcf rules in mind, we might (having already copied and pasted subject1 and subject2 into one text file)
subject1_2_step1 <- gsub('\\{', '', subject1_2)
subject1_2_step2 <- gsub('\\}', '', subject1_2)
subject1_2_step3 <- gsub(',', '\n', subject1_2)
subject1_2_step4_dcf <- read.dcf(textConnection(subject1_2_step3), all = TRUE)
Error in read.dcf(textConnection(subject1_2_step3), all = TRUE) :
Invalid DCF format.
Regular lines must have a tag.
Offending lines start with:
list(c("false
9\"
"false
5\"
))
It is easier to see in a text editor that these (9 and 5) are continuations of the prior tag:value pair, perhaps a clinician criticality indication, and should have a space before them. You could regex, find them and put the spaces, and in the end you still wouldn't have subject:1, or subject:2, as seen above because those aren't in the records, they're the file names. The same could likely be said for jsonlite. And replaced all '"' with '' for easier column name reading.
