Is there an easy way to remove certain (stop) words from sentences in a list of lists in a dataframe column and (right)-pad them if they have a length less than the maximum length?
Example:
import pandas as pd
stopwords = ['the', 'a', 'an']
df = pd.DataFrame(data={'sentence': [[["the", "deer", 'was', 'a', 'tasty', 'meal'], ["the", "girl", 'walks'], ["thanks", "for", "all", "the", "gifts"]]]})
| | sentence |
|---:|:-------------------------------------------------------------------------------------------------------------------|
| 0 | [['the', 'deer', 'was', 'a', 'tasty', 'meal'], ['the', 'girl', 'walks'], ['thanks', 'for', 'all', 'the', 'gifts']] |
Expected result:
| | sentence |
|---:|:------------------------------------|
| 0 | ['deer', 'was', 'tasty', 'meal'] |
| 1 | ['girl', 'walks', '<pad>', '<pad>'] |
| 2 | ['thanks', 'for', 'all', 'gifts'] |
CodePudding user response:
Try this:
x = df['sentence'].explode().reset_index(drop=True).explode().pipe(lambda x: x[~x.isin(stopwords)])
MAX = x.groupby(level=0).agg(len).max()
new_df = x.groupby(level=0).apply(lambda x: x.reset_index(drop=True).reindex(np.arange(MAX)).fillna('<pad>')).groupby(level=0).agg(list).to_frame()
Output:
>>> new_df
sentence
0 [deer, was, tasty, meal]
1 [girl, walks, <pad>, <pad>]
2 [thanks, for, all, gifts]
It uses explode twice to get the sub arrays all flattened, and then via pipe filters out the stop words. Then, we get the length of the longest group, and reindex each group to be as long as that. Note the fill value is <pad>, but you can change it to whatever you'd like, or even get rid of the fillna call altogether.
CodePudding user response:
Here is a way using reshaping:
df2 = (df.explode('sentence')
.assign(group=lambda d: d.groupby(d.index).cumcount())
.explode('sentence')
.loc[lambda d: ~d['sentence'].isin(stopwords)] # filter words
.rename_axis('index')
.assign(idx=lambda d: d.groupby(['index', 'group']).cumcount())
.set_index(['group', 'idx'], append=True)
.unstack('group') # unstack/stack
.fillna('<pad>') # to pad
.stack('group') # missing words
.groupby(level=[0, 'group']).agg(list)
.reset_index('group', drop=True)
)
output:
sentence
0 [deer, was, tasty, meal]
0 [girl, walks, <pad>, <pad>]
0 [thanks, for, all, gifts]
NB. this solution should also work on multiple input lines:
df = pd.concat([df]*3, ignore_index=True)
# sentence
# 0 [[the, deer, was, a, tasty, meal], [the, girl,...
# 1 [[the, deer, was, a, tasty, meal], [the, girl,...
# 2 [[the, deer, was, a, tasty, meal], [the, girl,...
output:
sentence
index
0 [deer, was, tasty, meal]
0 [girl, walks, <pad>, <pad>]
0 [thanks, for, all, gifts]
1 [deer, was, tasty, meal]
1 [girl, walks, <pad>, <pad>]
1 [thanks, for, all, gifts]
2 [deer, was, tasty, meal]
2 [girl, walks, <pad>, <pad>]
2 [thanks, for, all, gifts]
