Home > database >  Lexical diversity returns TypeError using Text column
Lexical diversity returns TypeError using Text column

Time:01-04

I need to calculate the lexical diversity. An example of textual column from a pandas dataframe is

Text
Happy new Year!
happy new Year! Wishing you all the best
New year is coming 
[Oh Oh oh... 2022 is here] # this is a string, not a list

I have tried as below:

from lexical_diversity import lex_div as ld

tok = ld.tokenize(df['Text'])
flt = ld.flemmatize(df['Text'])
ld.mtld_ma_bid(flt)

but I got the error: TypeError: expected string or bytes-like object when I run ld.tokenize. Text is an object.
Is there anything that I am missing? I also dropped rows with missing data.

CodePudding user response:

ld.tokenize doesn't understand how to deal with list (or a Series). You have to apply function on each row individually:

tok = df['Text'].apply(ld.tokenize)
flt = df['Text'].apply(ld.flemmatize)
flt.apply(ld.mtld_ma_bid)
  •  Tags:  
  • Related