Detecting missing single parentheses in the presence of potentially missing double parentheses-CodePudding

I have partly faulty transcriptions, which I want to clean-up:

df <- data.frame(
  id = 1:14,
  Utterance = c("((v: laughs))",                                                # Double parentheses -- transcriber comment
                "(1.620)",                                                      # OK -- pause
                "doesn't like it [((silent m & t: leaning forward r hand))]",   # Double parentheses -- silent gesture
                "<the glasses?> (.) °or what°",                                 # OK -- micro-pause
                "°°phew°° ok(h)ay",                                             # OK -- within-word laughter
                "just get over it ((name ID08.B))=",                            # Double parentheses -- anonymization
                "°are our phones¿ (ins-) no°",                                  # OK -- candidate hearing
                "without parentheses",                                          # no parentheses used!
                "like ((silent m: fist pump)",                                  # Double parentheses -- silent gesture --> closing bracket missing
                "(1.620",                                                       # --> closing bracket is missing
                "(°SC one?° (1.231) right?",                                    # --> closing bracket is missing
                "[((v: laughs))] (v: gasps))",                                  # Double parentheses -- transcriber comment --> opening bracket is missing
                "°°phew°° okh)ay",                                              # --> opening bracket is missing
                "yep v: yawns)) u:m")                                           # Double parentheses -- transcriber comment --> opening brackets is missing
)

The focus here is detecting Utterances where single round brackets are missing. The problem I'm struggling with is the fact that in Utterance there may also be items wrapped in double round brackets - which may be missing as well. I try to exclude double round brackets from the detection algorithm by using negative lookarounds (items wrapped in double parentheses have name or silent or v: right after the opening bracket), but the success is limited:

df %>%
  mutate(
    # extract `(` and `)`, then paste them together in a string in a new column `delims`:
    delims = lapply(str_extract_all(Utterance, "(?<!\\()\\((?!(name|silent|v:))|(?<!(name|silent|v:)[^)]{1,50}\\))\\)"), paste0, collapse = ""),
    # test whether the number of `()` combinations equals half the number of characters in `delims`:
    valid = str_count(delims, "\\(\\)") == nchar(delims)/2) %>%
  # filter on those rows where the test returns FALSE:
  filter(valid == FALSE)
  id                   Utterance delims valid
1 10                      (1.620      ( FALSE
2 11   (°SC one?° (1.231) right?    (() FALSE
3 12 [((v: laughs))] (v: gasps))    ()) FALSE
4 13             °°phew°° okh)ay      ) FALSE
5 14          yep v: yawns)) u:m      ) FALSE

The expected result is this:

  id                   Utterance delims valid
1 10                      (1.620      ( FALSE
2 11   (°SC one?° (1.231) right?    (() FALSE
3 13             °°phew°° okh)ay      ) FALSE

How can the detection be improved?

CodePudding user response：

If I understand this correctly, you want to completely exclude Utterance strings with double brackets? If yes, I think this solves your problem

df <- data.frame(
  id = 1:14,
  Utterance = c("((v: laughs))",                                                # Double parentheses -- transcriber comment
                "(1.620)",                                                      # OK -- pause
                "doesn't like it [((silent m & t: leaning forward r hand))]",   # Double parentheses -- silent gesture
                "<the glasses?> (.) °or what°",                                 # OK -- micro-pause
                "°°phew°° ok(h)ay",                                             # OK -- within-word laughter
                "just get over it ((name ID08.B))=",                            # Double parentheses -- anonymization
                "°are our phones¿ (ins-) no°",                                  # OK -- candidate hearing
                "without parentheses",                                          # no parentheses used!
                "like ((silent m: fist pump)",                                  # Double parentheses -- silent gesture --> closing bracket missing
                "(1.620",                                                       # --> closing bracket is missing
                "(°SC one?° (1.231) right?",                                    # --> closing bracket is missing
                "[((v: laughs))] (v: gasps))",                                  # Double parentheses -- transcriber comment --> opening bracket is missing
                "°°phew°° okh)ay",                                              # --> opening bracket is missing
                "yep v: yawns)) u:m")                                           # Double parentheses -- transcriber comment --> opening brackets is missing
)

df
#>    id                                                  Utterance
#> 1   1                                              ((v: laughs))
#> 2   2                                                    (1.620)
#> 3   3 doesn't like it [((silent m & t: leaning forward r hand))]
#> 4   4                               <the glasses?> (.) °or what°
#> 5   5                                           °°phew°° ok(h)ay
#> 6   6                          just get over it ((name ID08.B))=
#> 7   7                                °are our phones¿ (ins-) no°
#> 8   8                                        without parentheses
#> 9   9                                like ((silent m: fist pump)
#> 10 10                                                     (1.620
#> 11 11                                  (°SC one?° (1.231) right?
#> 12 12                                [((v: laughs))] (v: gasps))
#> 13 13                                            °°phew°° okh)ay
#> 14 14                                         yep v: yawns)) u:m

df |> 
  dplyr::mutate(
    opening = stringr::str_count(Utterance, "\\("),
    closing = stringr::str_count(Utterance, "\\)"),
    double_br = stringr::str_detect(Utterance, "\\({2}|\\){2}")
  ) |> 
  dplyr::filter(opening != closing, double_br == FALSE)
#>   id                 Utterance opening closing double_br
#> 1 10                    (1.620       1       0     FALSE
#> 2 11 (°SC one?° (1.231) right?       2       1     FALSE
#> 3 13           °°phew°° okh)ay       0       1     FALSE

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

CodePudding user response：

Using base R, you could try:

transform(df, v = grepl('^(?=.*[()])(?!([^()]*)[(](?1)[)](?1))', 
                      gsub('[(]{,2}(name|silent|v:)[^()] [)] ', '', Utterance),
                      perl = TRUE))

  id                                                  Utterance     v
1   1                                              ((v: laughs)) FALSE
2   2                                                    (1.620) FALSE
3   3 doesn't like it [((silent m & t: leaning forward r hand))] FALSE
4   4                               <the glasses?> (.) °or what° FALSE
5   5                                           °°phew°° ok(h)ay FALSE
6   6                          just get over it ((name ID08.B))= FALSE
7   7                                °are our phones¿ (ins-) no° FALSE
8   8                                        without parentheses FALSE
9   9                                like ((silent m: fist pump) FALSE
10 10                                                     (1.620  TRUE
11 11                                  (°SC one?° (1.231) right?  TRUE
12 12                                [((v: laughs))] (v: gasps)) FALSE
13 13                                            °°phew°° okh)ay  TRUE
14 14                                         yep v: yawns)) u:m FALSE
15 15                <the glasses?> (. °or what° ((name ID07.A))  TRUE

EDIT:

df |>
   transform(v = grepl('^(?=.*[()])(?!([^()]*)[(](?1)[)](?1))', 
                    gsub('[(]{,2}(name|silent|v:)[^()] [)] ', '', Utterance),
                   perl = TRUE)) |>
   subset(v)
   id                                   Utterance    v
10 10                                      (1.620 TRUE
11 11                   (°SC one?° (1.231) right? TRUE
13 13                             °°phew°° okh)ay TRUE
15 15 <the glasses?> (. °or what° ((name ID07.A)) TRUE