Home > Software design >  R: How to count the total number of tokens in a corpus?
R: How to count the total number of tokens in a corpus?

Time:02-01

I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words for all 190 texts.

CodePudding user response:

you can just use the sum() function which is really simple. I left an example:

test <- c("testing string number 1","testing string number 2")

sum(quanteda::ntoken(test))

Result:

> quanteda::ntoken(test)
text1 text2 
    4     4 
> sum(quanteda::ntoken(test))
[1] 8
> 

In case you are using pipes, which is pretty common with quanteda

> quanteda::ntoken(test) %>% sum()
[1] 8
  •  Tags:  
  • Related