Home > Mobile >  Filter rows where mirror-image delimiters are not paired
Filter rows where mirror-image delimiters are not paired

Time:01-22

I have transcriptions of speech with "mirror-image" delimiters, i.e., paired symbols marking opening and, respectively, closing, such as ( and ) or < and >. The delimiter in this data is the square bracket:

df <- data.frame(
  id = 1:9,
  Utterance = c("[but if I came !ho!me",                         # <- closing square bracket is missing
                "=[ye::a:h]",                                    # OK!
                "=[yeah] I mean [does it",                       # <- closing square bracket is missing
                "bu[t if (.) you know",                          # <- closing square bracket is missing
                "=ye::a:h]",                                     # <- opening square bracket is missing
                "[that's right] YEAH (laughs)] [ye::a:h]",        # <- opening square bracket is missing
                "cos I've [heard] very sketchy stories",         # OK!
                "[cos] I've [heard very sketchy [stories]",      # <- closing square bracket is missing 
                "oh well] that's great"                          # <- opening square bracket is missing       
))

I want to filter those rows where either at least one of the opening or at least one the closing delimiters is missing (as this represents a transcription error). I'm actually doing fine with this str_count method:

library(string)
library(dplyr)
df %>% 
   filter(str_count(Utterance, "\\[|\\]") %in% c(1,3,5,7,9))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

but was wondering whether regexes could be devised to detect the strings with missing elements directly. I've been trying this regex, for missing closing brackets:

p_op <- "(?<!.{0,10}\\[.{0,10})\\].*$"       
df %>%
  filter(str_detect(Utterance, p_op))

which works well, and this for missing closing brackets, which does not capture all matches:

p_cl<- "\\[(?!.*\\]).*$"    
df %>%
  filter(str_detect(Utterance, p_cl))

How can the pattern or the patterns be formulated better?

CodePudding user response:

Instead of %in% a vector, we can use %% with 2

library(dplyr)
library(stringr)
df %>% 
    filter(as.logical(str_count(Utterance, "\\[|\\]")%% 2))

-output

 id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

Or may use the pattern (\\[[^\\]] (\\[|$)|(^|\\])[^\\[] \\]) in str_detect

df %>%
   filter(str_detect(Utterance, "\\[[^\\]] (\\[|$)|(^|\\])[^\\[] \\]"))
  id                                Utterance
1  1                    [but if I came !ho!me
2  3                  =[yeah] I mean [does it
3  4                     bu[t if (.) you know
4  5                                =ye::a:h]
5  6  [that's right] YEAH (laughs)] [ye::a:h]
6  8 [cos] I've [heard very sketchy [stories]
7  9                    oh well] that's great

Here we check for a opening bracket [ followed by one or more characters that are not ] followed by a [ or the end of the string ($) or a similar pattern for the closing bracket

CodePudding user response:

Another possible solution, using purrr::map_dfr.

EXPLANATION

I provide, in what follows, an explanation for my solution, as asked for by @ChrisRuehlemann:

  1. With str_extract_all(df$Utterance, "\\[|\\]"), we extract all [ and ] of each utterance as a list and according to the order they appear in the utterance.

  2. We iterate all lists created previously for the utterances. However, we have a list of square brackets. So, we need to beforehand collapse the list into a single string of square brackets (str_c(.x, collapse = "")).

  3. We compare the string of square brackets of each utterance with a string like the following [][][]... (str_c(rep("[]", length(.x)/2), collapse = "")). If these two strings are not equal, then square brackets are missing!

  4. When map_dfr finishes, we end up with a column of TRUE and FALSE, which we can use to filter the original dataframe as wanted.

library(tidyverse)    

str_extract_all(df$Utterance, "\\[|\\]") %>% 
  map_dfr(~ list(OK = str_c(.x, collapse = "") != 
            str_c(rep("[]", length(.x)/2), collapse = ""))) %>% 
  filter(df,.)

#>   id                                Utterance
#> 1  1                    [but if I came !ho!me
#> 2  3                  =[yeah] I mean [does it
#> 3  4                     bu[t if (.) you know
#> 4  5                                =ye::a:h]
#> 5  6  [that's right] YEAH (laughs)] [ye::a:h]
#> 6  8 [cos] I've [heard very sketchy [stories]
#> 7  9                    oh well] that's great

CodePudding user response:

If you need a function to validate (nested) parenthesis, here is a stack based one.

valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
  f <- function(x, delim, max_stack_size){
    if(is.null(names(delim))) {
      names(delim) <- c("open", "close")
    }
    if(nchar(x) > 0L){
      valid <- TRUE
      stack <- character(max_stack_size)
      i_stack <- 0L
      y <- unlist(strsplit(x, ""))
      for(i in seq_along(y)){
        if(y[i] == delim["open"]){
          i_stack <- i_stack   1L
          stack[i_stack] <- delim["close"]
        } else if(y[i] == delim["close"]) {
          valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
          if(valid)
            i_stack <- i_stack - 1L
          else break
        }
      }
      valid && (i_stack == 0L)
    } else NULL
  }
  x <- as.character(x)
  y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
  unname(y)
}

library(dplyr)

valid_delim(df$Utterance)
#[1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

df %>% filter(valid_delim(Utterance))
#  id                             Utterance
#1  2                            =[ye::a:h]
#2  7 cos I've [heard] very sketchy stories
  •  Tags:  
  • Related