i need somebody's help, i have a column with words, i want to remove the duplicated words inside each cell
what i want to get is something like this
| words | expected |
|---|---|
| car apple car good | car apple good |
| good bad well good | good bad well |
| car apple bus food | car apple bus food |
i've tried this but is not working
from collections import OrderedDict
df['expected'] = (df['words'].str.split().apply(lambda x: OrderedDict.fromkeys(x).keys()).str.join(' '))
I'll be very grateful if somebody can help me
CodePudding user response:
If order is important use dict.fromkeys in a list comprehension:
df['expected'] = [' '.join(dict.fromkeys(w.split())) for w in df['words']]
output:
words expected
0 car apple car good car apple good
1 good bad well good good bad well
2 car apple bus food car apple bus food
CodePudding user response:
If you don't need to retain the original order of the words, you can create an intermediate set which will remove duplicates.
df["expected"] = df["words"].str.split().apply(set).str.join(" ")
CodePudding user response:
if words are string "word1 word2":
df['expected'] = [" ".join(set(wrds.strip().split())) for wrds in df.words]
