I have a data frame as shown below and I have to prepare the "target column"
| row | keyphrase | tag1 | tag2 | tag3 | target column (not given, to be prepared) |
|---|---|---|---|---|---|
| 1 | word1 word2 | none | word1 | word2 | <tag2>word1<tag2> <tag3>word2<tag3> |
| 2 | word3 word4 | none | none | word3, word4 | <tag3>word3<tag3> <tag3>word4<tag3> (there is a comma so different tag) |
| 3 | word5 word6 | word6 | none | word5 | <tag3>word5<tag3> <tag1>word6<tag1> |
| 4 | word7 word8 | none | word7 word8 | none | <tag2>word7 word8<tag2> (since there is not comma, it is a same word) |
| 5 | word9 word10 word11 | word9 | none | word10 word11 | <tag1>word9<tag1> <tag3>word10 word11<tag3> |
| 6 | word12 word13 word14 | word13 | word12 | word14 | <tag2>word12<tag2> <tag1>word13<tag1> <tag3>word14<tag3> |
If there a comma in between two words under the same column, then it has to be tagged twice as shown in the row 2. If there is no comma but a white space then it has to be treated as same word i.e., it has to be tagged once. ALSO, PLEASE NOTE THAT ORDER ALSO MATTERS (row 3 and row 6). Ignore none (i.e., no words are there so no tags)
I have done using iteration (i.e., using iloc, for loop) but it takes huge amount of time as the data points are more than 200k and the number of tags are also more than 20. I would like to have an efficient code. Please let me know. Also, let me know if you want extra clarity regarding the question/examples.
CodePudding user response:
You can use DataFrame.filter for columns with tag text, reshape by DataFrame.stack for remove None, NaN, Series.str.split with Series.explode for repeat values with ,. Then join columns with < and >:
#if none is not None (NoneType) convert to NaN
df = df.replace('none',np.nan)
df1 = (df.set_index('keyphrase', append=True)
.filter(like='tag')
.stack()
.rename_axis(('idx','keyphrase','tags'))
.str.split(', ')
.explode()
.reset_index(name='word')
.rename_axis('i')
)
#for correct ordering by splitted keyphrase by reindex
df1 = df1.assign(word = df1['word'].str.split()).explode('word')
df1['keyphrase'] = df1.groupby('idx')['keyphrase'].transform(lambda x: x.iat[0].split())
idx = df1.set_index(['idx','word'], append=True).index.droplevel(0)
df1 = df1.set_index(['idx','keyphrase'], append=True).reset_index(level=0).reindex(idx)
df1 = df1.groupby(['idx','i','tags'], sort=False)['word'].agg(' '.join).reset_index()
df1['word'] = '<' df1['tags'] '>' df1['word'] '<' df1['tags'] '>'
print (df1)
idx i tags word
0 0 0 tag2 <tag2>word1<tag2>
1 0 1 tag3 <tag3>word2<tag3>
2 1 2 tag3 <tag3>word3<tag3>
3 1 3 tag3 <tag3>word4<tag3>
4 2 5 tag3 <tag3>word5<tag3>
5 2 4 tag1 <tag1>word6<tag1>
6 3 6 tag2 <tag2>word7 word8<tag2>
7 4 7 tag1 <tag1>word9<tag1>
8 4 8 tag3 <tag3>word10 word11<tag3>
9 5 10 tag2 <tag2>word12<tag2>
10 5 9 tag1 <tag1>word13<tag1>
11 5 11 tag3 <tag3>word14<tag3>
df['target'] = df1.groupby('idx')['word'].agg(''.join)
print (df)
row keyphrase tag1 tag2 tag3 \
0 1 word1 word2 NaN word1 word2
1 2 word3 word4 NaN NaN word3, word4
2 3 word5 word6 word6 NaN word5
3 4 word7 word8 NaN word7 word8 None
4 5 word9 word10 word11 word9 NaN word10 word11
5 6 word12 word13 word14 word13 word12 word14
target
0 <tag2>word1<tag2><tag3>word2<tag3>
1 <tag3>word3<tag3><tag3>word4<tag3>
2 <tag3>word5<tag3><tag1>word6<tag1>
3 <tag2>word7 word8<tag2>
4 <tag1>word9<tag1><tag3>word10 word11<tag3>
5 <tag2>word12<tag2><tag1>word13<tag1><tag3>word...
