I working on cleaning a large collection of text. My process thus far is:
- Remove any non-ASCII characters
- Remove URLs
- Remove email addresses
- Correct kerning (i.e., "B A D" becomes "BAD")
- Correct elongated words (i.e., "baaaaaad" becomes "bad")
- Ensure there is a space after every comma
- Replace all numerals and punctuation with a space - except apostrophes
- Remove any term 22 characters or longer (anything this size is likely garbage)
- Remove any single letters that are leftover
- Remove any blank lines
My issue is in the next-to-last step. Originally, my code was:
gsub(pattern = "\\b\\S\\b", replacement = "", perl = TRUE)
but this wrecked any contractions that were left (that I left in on purpose). Then I tried
gsub(pattern = "\\b(\\S^'\\s)\\b", replacement = "", perl = TRUE)
but this left a lot of single characters.
Then I realized that I needed to keep three single-letter words: "A", "I", and "O" (either case).
Any suggestions?
CodePudding user response:
You can use
gsub("(?i)\\b(?<!')(?![AOI])\\p{L}\\b", "", x, perl=TRUE)
Details:
(?i)- case insensitive matching on\b- a word boundary(?<!')- no'is allowed immediately on the left(?![AOI])- the next char cannot beA,I, orO\p{L}- any Unicod letter\b- a word boundary
