I have transcriptions of speech with "mirror-image" delimiters, i.e., paired symbols marking opening and, respectively, closing, such as ( and ) or < and >. The delimiter in this data is the square bracket:
df <- data.frame(
id = 1:9,
Utterance = c("[but if I came !ho!me", # <- closing square bracket is missing
"=[ye::a:h]", # OK!
"=[yeah] I mean [does it", # <- closing square bracket is missing
"bu[t if (.) you know", # <- closing square bracket is missing
"=ye::a:h]", # <- opening square bracket is missing
"[that's right] YEAH (laughs)] [ye::a:h]", # <- opening square bracket is missing
"cos I've [heard] very sketchy stories", # OK!
"[cos] I've [heard very sketchy [stories]", # <- closing square bracket is missing
"oh well] that's great" # <- opening square bracket is missing
))
I want to filter those rows where either at least one of the opening or at least one the closing delimiters is missing (as this represents a transcription error).
I'm actually doing fine with this str_count method:
library(string)
library(dplyr)
df %>%
filter(str_count(Utterance, "\\[|\\]") %in% c(1,3,5,7,9))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
but was wondering whether regexes could be devised to detect the strings with missing elements directly. I've been trying this regex, for missing closing brackets:
p_op <- "(?<!.{0,10}\\[.{0,10})\\].*$"
df %>%
filter(str_detect(Utterance, p_op))
which works well, and this for missing closing brackets, which does not capture all matches:
p_cl<- "\\[(?!.*\\]).*$"
df %>%
filter(str_detect(Utterance, p_cl))
How can the pattern or the patterns be formulated better?
CodePudding user response:
Instead of %in% a vector, we can use %% with 2
library(dplyr)
library(stringr)
df %>%
filter(as.logical(str_count(Utterance, "\\[|\\]")%% 2))
-output
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
Or may use the pattern (\\[[^\\]] (\\[|$)|(^|\\])[^\\[] \\]) in str_detect
df %>%
filter(str_detect(Utterance, "\\[[^\\]] (\\[|$)|(^|\\])[^\\[] \\]"))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
Here we check for a opening bracket [ followed by one or more characters that are not ] followed by a [ or the end of the string ($) or a similar pattern for the closing bracket
CodePudding user response:
Another possible solution, using purrr::map_dfr.
EXPLANATION
I provide, in what follows, an explanation for my solution, as asked for by @ChrisRuehlemann:
With
str_extract_all(df$Utterance, "\\[|\\]"), we extract all[and]of each utterance as a list and according to the order they appear in the utterance.We iterate all lists created previously for the utterances. However, we have a list of square brackets. So, we need to beforehand collapse the list into a single string of square brackets (
str_c(.x, collapse = "")).We compare the string of square brackets of each utterance with a string like the following
[][][]...(str_c(rep("[]", length(.x)/2), collapse = "")). If these two strings are not equal, then square brackets are missing!When
map_dfrfinishes, we end up with a column ofTRUEandFALSE, which we can use to filter the original dataframe as wanted.
library(tidyverse)
str_extract_all(df$Utterance, "\\[|\\]") %>%
map_dfr(~ list(OK = str_c(.x, collapse = "") !=
str_c(rep("[]", length(.x)/2), collapse = ""))) %>%
filter(df,.)
#> id Utterance
#> 1 1 [but if I came !ho!me
#> 2 3 =[yeah] I mean [does it
#> 3 4 bu[t if (.) you know
#> 4 5 =ye::a:h]
#> 5 6 [that's right] YEAH (laughs)] [ye::a:h]
#> 6 8 [cos] I've [heard very sketchy [stories]
#> 7 9 oh well] that's great
CodePudding user response:
If you need a function to validate (nested) parenthesis, here is a stack based one.
valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
f <- function(x, delim, max_stack_size){
if(is.null(names(delim))) {
names(delim) <- c("open", "close")
}
if(nchar(x) > 0L){
valid <- TRUE
stack <- character(max_stack_size)
i_stack <- 0L
y <- unlist(strsplit(x, ""))
for(i in seq_along(y)){
if(y[i] == delim["open"]){
i_stack <- i_stack 1L
stack[i_stack] <- delim["close"]
} else if(y[i] == delim["close"]) {
valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
if(valid)
i_stack <- i_stack - 1L
else break
}
}
valid && (i_stack == 0L)
} else NULL
}
x <- as.character(x)
y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
unname(y)
}
library(dplyr)
valid_delim(df$Utterance)
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
df %>% filter(valid_delim(Utterance))
# id Utterance
#1 2 =[ye::a:h]
#2 7 cos I've [heard] very sketchy stories
