I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.
one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==1){one_word_title <- one_word_title 1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==2){two_word_title <- two_word_title 1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==3){three_word_title <- three_word_title 1}
}
one_word_title
two_word_title
three_word_title
Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?
CodePudding user response:
Here's a proposal somewhat tentative given the absence of reproducible data:
Let's assume you have this kind of data and titles:
df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))
To get the "distribution" of number of words in the titlesyou can do this:
library(dplyr)
library(stringr)
df %>%
mutate(N_w = str_count(titles, "\\S ")) %>%
group_by(N_w) %>%
summarise(Dist_N_w = n())
# A tibble: 3 x 2
N_w Dist_N_w
* <int> <int>
1 2 2
2 3 2
3 6 1
Note that using \\w and, respectively, \\S makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.
CodePudding user response:
Instead of doing this for every word separately, you can do this together.
table(stringr::str_count(jnl.dt$`Full Title`, '\\w '))
CodePudding user response:
We may use unnest_tokens
library(tidytext)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, titles) %>%
count(rn) %>%
count(n)
