gsub extracting string-CodePudding

My sample data is:

    c("2\tNO  PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217", 
"1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156", 
"5\tUNABLE TO WORK  PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185", 
"2\tNO  PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433"

For each line, I'm looking to extract (they are variable names):

Line 1: "PEMJNUM" Line 2: "PRFAMTYP" Line 3: "PUBUS1" Line 4: "PEIO1COW"

My initial goal was to gsub remove the characters to the left and right of each variable name to leave just the variable names, but I was only able to grab everything to the right of the variable name and had issues with grabbing characters to the left. (as shown here https://regexr.com/67r6j).

Not sure if there's a better way to do this!

CodePudding user response：

You can use sub in the following way:

x <- c("2\tNO  PEMJNUM\t 2\tALTOGETHER HOW MANY JOBS\t216 - 217", 
 "1\tREFERENCE PERSON 2\tSPOUSE 3\tCHILD 4\tOTHER RELATIVE (PRIMARY FAMILY & UNREL) PRFAMTYP\t2\tFAMILY TYPE RECODE\t155 - 156", 
 "5\tUNABLE TO WORK  PUBUS1\t 2\tLAST WEEK DID YOU DO ANY\t184 - 185", 
 "2\tNO  PEIO1COW\t 2\tINDIVIDUAL CLASS OF WORKER CODE\t432 - 433")
sub("^(?:.*\\b)?(\\w )\\s*\\b2\\b.*", "\\1", x, perl=TRUE)
# => [1] "PEMJNUM"  "PRFAMTYP" "PUBUS1"   "PEIO1COW"

See the online regex demo and the R demo.

Details:

^ - start of string
(?:.*\b)? - an optional non-capturing group that matches any zero or more chars (other than line break chars since I use perl=TRUE, if you need to match line breaks, too, add (?s) at the pattern start) as many as possible, and then a word boundary position
(\w ) - Group 1 (\1): one or more word chars
\s* - zero or more whitespaces
\b - a word boundary
2 - a 2 digit
\b - a word boundary
.* - the rest of the line/string.

If there are always whitespaces before 2, the regex can be written as "^(?:.*\\b)?(\\w )\\s 2\\b.*".