I need to calculate the lexical diversity. An example of textual column from a pandas dataframe is
Text
Happy new Year!
happy new Year! Wishing you all the best
New year is coming
[Oh Oh oh... 2022 is here] # this is a string, not a list
I have tried as below:
from lexical_diversity import lex_div as ld
tok = ld.tokenize(df['Text'])
flt = ld.flemmatize(df['Text'])
ld.mtld_ma_bid(flt)
but I got the error: TypeError: expected string or bytes-like object when I run ld.tokenize. Text is an object.
Is there anything that I am missing? I also dropped rows with missing data.
CodePudding user response:
ld.tokenize doesn't understand how to deal with list (or a Series). You have to apply function on each row individually:
tok = df['Text'].apply(ld.tokenize)
flt = df['Text'].apply(ld.flemmatize)
flt.apply(ld.mtld_ma_bid)
