I have partly faulty transcriptions, which I want to clean-up:
df <- data.frame(
id = 1:14,
Utterance = c("((v: laughs))", # Double parentheses -- transcriber comment
"(1.620)", # OK -- pause
"doesn't like it [((silent m & t: leaning forward r hand))]", # Double parentheses -- silent gesture
"<the glasses?> (.) °or what°", # OK -- micro-pause
"°°phew°° ok(h)ay", # OK -- within-word laughter
"just get over it ((name ID08.B))=", # Double parentheses -- anonymization
"°are our phones¿ (ins-) no°", # OK -- candidate hearing
"without parentheses", # no parentheses used!
"like ((silent m: fist pump)", # Double parentheses -- silent gesture --> closing bracket missing
"(1.620", # --> closing bracket is missing
"(°SC one?° (1.231) right?", # --> closing bracket is missing
"[((v: laughs))] (v: gasps))", # Double parentheses -- transcriber comment --> opening bracket is missing
"°°phew°° okh)ay", # --> opening bracket is missing
"yep v: yawns)) u:m") # Double parentheses -- transcriber comment --> opening brackets is missing
)
The focus here is detecting Utterances where single round brackets are missing. The problem I'm struggling with is the fact that in Utterance there may also be items wrapped in double round brackets - which may be missing as well. I try to exclude double round brackets from the detection algorithm by using negative lookarounds (items wrapped in double parentheses have name or silent or v: right after the opening bracket), but the success is limited:
df %>%
mutate(
# extract `(` and `)`, then paste them together in a string in a new column `delims`:
delims = lapply(str_extract_all(Utterance, "(?<!\\()\\((?!(name|silent|v:))|(?<!(name|silent|v:)[^)]{1,50}\\))\\)"), paste0, collapse = ""),
# test whether the number of `()` combinations equals half the number of characters in `delims`:
valid = str_count(delims, "\\(\\)") == nchar(delims)/2) %>%
# filter on those rows where the test returns FALSE:
filter(valid == FALSE)
id Utterance delims valid
1 10 (1.620 ( FALSE
2 11 (°SC one?° (1.231) right? (() FALSE
3 12 [((v: laughs))] (v: gasps)) ()) FALSE
4 13 °°phew°° okh)ay ) FALSE
5 14 yep v: yawns)) u:m ) FALSE
The expected result is this:
id Utterance delims valid
1 10 (1.620 ( FALSE
2 11 (°SC one?° (1.231) right? (() FALSE
3 13 °°phew°° okh)ay ) FALSE
How can the detection be improved?
CodePudding user response:
If I understand this correctly, you want to completely exclude Utterance strings with double brackets? If yes, I think this solves your problem
df <- data.frame(
id = 1:14,
Utterance = c("((v: laughs))", # Double parentheses -- transcriber comment
"(1.620)", # OK -- pause
"doesn't like it [((silent m & t: leaning forward r hand))]", # Double parentheses -- silent gesture
"<the glasses?> (.) °or what°", # OK -- micro-pause
"°°phew°° ok(h)ay", # OK -- within-word laughter
"just get over it ((name ID08.B))=", # Double parentheses -- anonymization
"°are our phones¿ (ins-) no°", # OK -- candidate hearing
"without parentheses", # no parentheses used!
"like ((silent m: fist pump)", # Double parentheses -- silent gesture --> closing bracket missing
"(1.620", # --> closing bracket is missing
"(°SC one?° (1.231) right?", # --> closing bracket is missing
"[((v: laughs))] (v: gasps))", # Double parentheses -- transcriber comment --> opening bracket is missing
"°°phew°° okh)ay", # --> opening bracket is missing
"yep v: yawns)) u:m") # Double parentheses -- transcriber comment --> opening brackets is missing
)
df
#> id Utterance
#> 1 1 ((v: laughs))
#> 2 2 (1.620)
#> 3 3 doesn't like it [((silent m & t: leaning forward r hand))]
#> 4 4 <the glasses?> (.) °or what°
#> 5 5 °°phew°° ok(h)ay
#> 6 6 just get over it ((name ID08.B))=
#> 7 7 °are our phones¿ (ins-) no°
#> 8 8 without parentheses
#> 9 9 like ((silent m: fist pump)
#> 10 10 (1.620
#> 11 11 (°SC one?° (1.231) right?
#> 12 12 [((v: laughs))] (v: gasps))
#> 13 13 °°phew°° okh)ay
#> 14 14 yep v: yawns)) u:m
df |>
dplyr::mutate(
opening = stringr::str_count(Utterance, "\\("),
closing = stringr::str_count(Utterance, "\\)"),
double_br = stringr::str_detect(Utterance, "\\({2}|\\){2}")
) |>
dplyr::filter(opening != closing, double_br == FALSE)
#> id Utterance opening closing double_br
#> 1 10 (1.620 1 0 FALSE
#> 2 11 (°SC one?° (1.231) right? 2 1 FALSE
#> 3 13 °°phew°° okh)ay 0 1 FALSE
Created on 2022-02-04 by the reprex package (v2.0.1)
CodePudding user response:
Using base R, you could try:
transform(df, v = grepl('^(?=.*[()])(?!([^()]*)[(](?1)[)](?1))',
gsub('[(]{,2}(name|silent|v:)[^()] [)] ', '', Utterance),
perl = TRUE))
id Utterance v
1 1 ((v: laughs)) FALSE
2 2 (1.620) FALSE
3 3 doesn't like it [((silent m & t: leaning forward r hand))] FALSE
4 4 <the glasses?> (.) °or what° FALSE
5 5 °°phew°° ok(h)ay FALSE
6 6 just get over it ((name ID08.B))= FALSE
7 7 °are our phones¿ (ins-) no° FALSE
8 8 without parentheses FALSE
9 9 like ((silent m: fist pump) FALSE
10 10 (1.620 TRUE
11 11 (°SC one?° (1.231) right? TRUE
12 12 [((v: laughs))] (v: gasps)) FALSE
13 13 °°phew°° okh)ay TRUE
14 14 yep v: yawns)) u:m FALSE
15 15 <the glasses?> (. °or what° ((name ID07.A)) TRUE
EDIT:
df |>
transform(v = grepl('^(?=.*[()])(?!([^()]*)[(](?1)[)](?1))',
gsub('[(]{,2}(name|silent|v:)[^()] [)] ', '', Utterance),
perl = TRUE)) |>
subset(v)
id Utterance v
10 10 (1.620 TRUE
11 11 (°SC one?° (1.231) right? TRUE
13 13 °°phew°° okh)ay TRUE
15 15 <the glasses?> (. °or what° ((name ID07.A)) TRUE
