Attaching pandas dataframe's column based on multiple conditions in an efficient way (python)-CodePudding

I have a data frame as shown below and I have to prepare the "target column"

row	keyphrase	tag1	tag2	tag3	target column (not given, to be prepared)
1	word1 word2	none	word1	word2	<tag2>word1<tag2> <tag3>word2<tag3>
2	word3 word4	none	none	word3, word4	<tag3>word3<tag3> <tag3>word4<tag3> (there is a comma so different tag)
3	word5 word6	word6	none	word5	<tag3>word5<tag3> <tag1>word6<tag1>
4	word7 word8	none	word7 word8	none	<tag2>word7 word8<tag2> (since there is not comma, it is a same word)
5	word9 word10 word11	word9	none	word10 word11	<tag1>word9<tag1> <tag3>word10 word11<tag3>
6	word12 word13 word14	word13	word12	word14	<tag2>word12<tag2> <tag1>word13<tag1> <tag3>word14<tag3>

If there a comma in between two words under the same column, then it has to be tagged twice as shown in the row 2. If there is no comma but a white space then it has to be treated as same word i.e., it has to be tagged once. ALSO, PLEASE NOTE THAT ORDER ALSO MATTERS (row 3 and row 6). Ignore none (i.e., no words are there so no tags)

I have done using iteration (i.e., using iloc, for loop) but it takes huge amount of time as the data points are more than 200k and the number of tags are also more than 20. I would like to have an efficient code. Please let me know. Also, let me know if you want extra clarity regarding the question/examples.

CodePudding user response：

You can use DataFrame.filter for columns with tag text, reshape by DataFrame.stack for remove None, NaN, Series.str.split with Series.explode for repeat values with ,. Then join columns with < and >:

#if none is not None (NoneType) convert to NaN

df = df.replace('none',np.nan)


df1 = (df.set_index('keyphrase', append=True)
         .filter(like='tag')
         .stack()
         .rename_axis(('idx','keyphrase','tags'))
         .str.split(', ')
         .explode()
         .reset_index(name='word')
         .rename_axis('i')
         )

#for correct ordering by splitted keyphrase by reindex
df1 = df1.assign(word = df1['word'].str.split()).explode('word')
df1['keyphrase'] = df1.groupby('idx')['keyphrase'].transform(lambda x: x.iat[0].split())
idx = df1.set_index(['idx','word'], append=True).index.droplevel(0)
df1 = df1.set_index(['idx','keyphrase'], append=True).reset_index(level=0).reindex(idx)

df1 = df1.groupby(['idx','i','tags'], sort=False)['word'].agg(' '.join).reset_index()
df1['word'] = '<'   df1['tags']   '>'   df1['word']  '<'   df1['tags']   '>'
print (df1)
    idx   i  tags                       word
0     0   0  tag2          <tag2>word1<tag2>
1     0   1  tag3          <tag3>word2<tag3>
2     1   2  tag3          <tag3>word3<tag3>
3     1   3  tag3          <tag3>word4<tag3>
4     2   5  tag3          <tag3>word5<tag3>
5     2   4  tag1          <tag1>word6<tag1>
6     3   6  tag2    <tag2>word7 word8<tag2>
7     4   7  tag1          <tag1>word9<tag1>
8     4   8  tag3  <tag3>word10 word11<tag3>
9     5  10  tag2         <tag2>word12<tag2>
10    5   9  tag1         <tag1>word13<tag1>
11    5  11  tag3         <tag3>word14<tag3>

df['target'] = df1.groupby('idx')['word'].agg(''.join)
print (df)
   row             keyphrase    tag1         tag2           tag3  \
0    1           word1 word2     NaN        word1          word2   
1    2           word3 word4     NaN          NaN   word3, word4   
2    3           word5 word6   word6          NaN          word5   
3    4           word7 word8     NaN  word7 word8           None   
4    5   word9 word10 word11   word9          NaN  word10 word11   
5    6  word12 word13 word14  word13       word12         word14   

                                              target  
0                 <tag2>word1<tag2><tag3>word2<tag3>  
1                 <tag3>word3<tag3><tag3>word4<tag3>  
2                 <tag3>word5<tag3><tag1>word6<tag1>  
3                            <tag2>word7 word8<tag2>  
4         <tag1>word9<tag1><tag3>word10 word11<tag3>  
5  <tag2>word12<tag2><tag1>word13<tag1><tag3>word...