Efficient way of sequentially detecting if certain strings occur at the beginning of other strings-CodePudding

Assume the following data:

df <- data.frame(id   = c(1:8),
                 text = c("i like", "i like to", "oops", "i like to and", "i like it not", "victoria", "victoria secret", "victoria secret is"))

What I want to do is to:

figure out if any given shorter string is part (more precisely) is the beginning of another string
apply a minimum character length, e.g. the string has to have at least X chars (let's set it to 5 for the example)
add the info to the data set which strings belong together.

My thinking is that I could sort my data frame based on the text response and the length of the text response and then check if the first string is part of any subsequent string, then I continue with the second string and check if it is part of the subsequent ones and so on. This is a computational nightmare, so I'm wondering if there's a computational more efficient way. I just thought that maybe first splitting up into words could make sense and then compare based on that? (comparing on full words would be fine, don't need to compare by every single character)

In addition, the problem could be that any longer response could be part of potentially all previous responses, meaning that the information that needs to be stored could potentially require n-1 columns (or lists of that length).

Just to put it into perspective: my real-life data has ~100.000 rows.

Here's how I could envision a potential expected output:

  id                text group_1 group_2 group_3
1  1              i like       1       1       0
2  2           i like to       1       0       0
3  3                oops       0       0       0
4  4       i like to and       1       0       0
5  5       i like it not       0       1       0
6  6            victoria       0       0       1
7  7     victoria secret       0       0       1
8  8  victoria secret is       0       0       1

Note, I only need a column if a certain strings qualifies for at least two rows. So in this case I don#t want/need to add a group variable for the "oops" text.

Texts 1, 2, 4 belong together because they all start with "i like" and they build up "sequentially", i.e. the second text is also part of the fourth text.
Rows 1 and 5 also belong toegther because text 1 is part of text 5.
Rows 6-8 also belong together for the same reason as texts 1, 2, 4 belong together (they build up on each other).

Alternatively, as a first step I could also work with an output that just gives me the information if a certain text is part of another text, so in the example just assign a 1 to all texts but "oops".

CodePudding user response：

You can optimize by doing everything in C/C . This Rcpp implementation might get you most of the way there, if you really only have 1e 05 strings.

Given a sorted character vector x and a positive integer nchar_min, match_start returns a list l of length length(x) such that l[[i]] is an integer vector indexing the strings in x that start with x[i], provided that nchar(x[i]) is at least nchar_min (otherwise l[[i]] is an integer vector of length 0).

It isn't optimal due to the inefficiency of storing all of the index vectors, when all you need is a recursive list or tree. I might be worried about tree depth, if you think think that complete nesting of your strings is a real possibility. (Edit: R supports tree depths up to 5e 05 via options(expressions=), so recursion might be fine after all.)

Rcpp::sourceCpp(code = '
#include <Rcpp.h>
#include <Rinternals.h>
// [[Rcpp::export]]
Rcpp::List match_start(Rcpp::CharacterVector x, int nchar_min = 1) {
  if (nchar_min < 1) {
    Rcpp::stop("\'nchar_min\' must be positive");
  }
  int n = x.length(), nchar_si, k;
  Rcpp::List res(n);
  for (int i = 0; i < n;   i) {
    const char* si = CHAR(x[i]);
    nchar_si = LENGTH(x[i]);
    Rcpp::IntegerVector el(0);
    if (nchar_si >= nchar_min) {
      el.push_back(i   1);
      for (int j = i   1; j < n;   j) {
        if (LENGTH(x[j]) < nchar_si) {
          break;
        }
        const char* sj = CHAR(x[j]);
        k = 0;
        while (k < nchar_si && si[k] == sj[k]) {
            k;
        }
        if (k == nchar_si) {
          el.push_back(j   1);
        }
      }
    }
    res[i] = el;
  }
  return res;
}
')

o <- order(df$text)
l <- match_start(df$text[o], nchar_min = 5L)
df$id_match[o] <- lapply(l, function(i) df$id[o][i])
df

  id               text   id_match
1  1             i like 1, 5, 2, 4
2  2          i like to       2, 4
3  3               oops           
4  4      i like to and          4
5  5      i like it not          5
6  6           victoria    6, 7, 8
7  7    victoria secret       7, 8
8  8 victoria secret is          8

You can optimize further if you have an approximation of the mean number of matches for a string in your data set. For simplicity, I have used push_back to increment the length of index vectors by one each time a match is found, but that is going to be slow if the mean number of matches is large. You can test on subsets of your data and report back...

CodePudding user response：

A couple of thoughts (have to be checked for runtime on a big data set).

gt5 are strings with the required length, should be straight forward.
occur are strings that occur more than once. You can leave it out, but I thought it may also be nice to know. Might be expensive.
gr is grouping by row number of occurrence of the largest common denominator. Only works if text is ordered by length of strings (either strictly, or, like in your case, by group-set). Uses agrep to do the fuzzy matching. (I have no idea how performant it is on large data sets). Also, you can replace agrep with grep(substr(x,1,8),substr(text,1,8)) if you really only care about a 100% match of the first, say 8 characters of a string. (Definition of beginning of a sentence)

library(dplyr)

df %>% 
  mutate(gt5=nchar(text)>5, 
         occur=rowSums(sapply(unique(unlist(strsplit(text, " "))), function(x) 
           grepl(x,text)))>1, 
         gr=sapply(text, function(x) max(grep(substr(x,1,8),substr(text,1,8)))) )
  id               text   gt5 occur gr
1  1             i like  TRUE  TRUE  4
2  2          i like to  TRUE  TRUE  4
3  3               oops FALSE FALSE  3
4  4      i like to and  TRUE  TRUE  4
5  5      not i like is  TRUE  TRUE  5
6  6           victoria  TRUE  TRUE  8
7  7    victoria secret  TRUE  TRUE  8
8  8 victoria secret is  TRUE  TRUE  8

Note: Assigning 1 - 5 or 1 - 4 is a decision problem when only allowing one group. On the other hand, allowing more than one group may explode the number of total groups.