I have raw data in this txt format:
Name|Occupation|Comment
Robert|Doctor|To process, please provide:
a. Tax Returns
b. Identification
c. Statement of Approval
Sally|Accountant|Approved
Here, | is the delimiter.
For Robert, I want "To process, please provide: a. Tax Returns b. Identification c. Statement of Approval" to be reflected as one string under Comment.
However, using read.csv to import with the following arguments:
read.csv(
"data/text_data",
fileEncoding = "UTF-8",
sep = "|",
na.strings = "",
quote = ""
)
gives me additional rows where
Name Occupation Comment
Robert Doctor To process, please provide:
a. Tax Returns NA NA
b. Identification NA NA
c. Statement of Approval NA NA
Sally Accountant Approved
Are there any R import functions or wrangling tricks to fix this? Thanks.
CodePudding user response:
An easy option splitting the data on a temporary "id" column using by, apply transformations and recombine. For convenience you may wrap this in a function.
myDataReader <- \(link) {
r <- read.csv(link, fileEncoding="UTF-8", sep="|", na.strings="", quote="")
r$id <- cumsum(!is.na(r$Occupation))
do.call(what=rbind, by(r, r$id, \(x) {
cbind(x[1, 1:2], Comment=trimws(paste(x[1, 3], toString(x[-1, 1]))))
}))
}
myDataReader('data/text_data')
# Name Occupation Comment
# 1 Robert Doctor To process, please provide: a. Tax Returns, b. Identification, c. Statement of Approval
# 2 Sally Accountant Approved
# 3 Tom Lawyer To process, please provide: a. Tax Returns, b. Identification
# 4 Sally Accountant Approved
Note: R>= 4.1 used.
Content of 'data/text_data':
Name|Occupation|Comment
Robert|Doctor|To process, please provide:
a. Tax Returns
b. Identification
c. Statement of Approval
Sally|Accountant|Approved
Tom|Lawyer|To process, please provide:
a. Tax Returns
b. Identification
Sally|Accountant|Approved
