Count words in each cell of a dataframe in R-CodePudding

I have a dataframe that looks like

df <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"), 
                     Variable1 = c("word1, word2", "word1", "word1"), 
                     Variable2 = c("word1", "word1, word2", "word1"), 
                     Variable3 = c("word1, word2", "word1", "word1, word2, word3")), 
                     row.names = c(NA, -3L), class = "data.frame")

and would like to create a df that counts occurrences of words in each cell (separated by ",") and input the number into each cell.

df2 <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"), 
                     Variable1 = c("2", "1", "1"), 
                     Variable2 = c("1", "2", "1"), 
                     Variable3 = c("2", "1", "3")), 
                     row.names = c(NA, -3L), class = "data.frame")

Would someone be able to help me in how this would be done?

Thanks!

CodePudding user response：

Using dplyr and stringi:

df %>% 
      mutate(across(matches("variable\\d{1,}"),stringi::stri_count_words))
      Variable Variable1 Variable2 Variable3
    1  Factor1         2         1         2
    2  Factor2         1         2         1
    3  Factor3         1         1         3

CodePudding user response：

I suppose you could try this if desired a base-R solution. Count the number of characters with nchar of a given character value, and subtract the number of characters after removing commas. The difference would be the number of commas (adding 1 would give the number of words/phrases separated by commas). This should be fast too (also see this answer).

cbind(df[1], t(apply(df[-1], 1, \(x) {
  nchar(x) - nchar(gsub(",", "", x, fixed = T))   1
})))

Output

  Variable Variable1 Variable2 Variable3
1  Factor1         2         1         2
2  Factor2         1         2         1
3  Factor3         1         1         3