Home > Net >  Extract pattern from a column based on another column's value
Extract pattern from a column based on another column's value

Time:02-01

given two columns of a pandas dataframe:

import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
      'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])

I'd like to extract the substring of column word that includes everything up to the end of the string in the corresponding column root or NaN if the string in root is not included in word. That is, the resulting dataframe would look as follows:

word       root    match
replay     play    replay
replayed   play    replay
playable   play    play
thinker    think   think
think      think   think
thoughtful think   NaN
ex)mple    ex)mple ex)mple

My dataframe has several thousand rows, so I'd like to avoid for-loops if necessary.

CodePudding user response:

You can use a regex with str.extract in a groupby apply:

import re
df['match'] = (df.groupby('root')['word']
                 .apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
               )

Or, if you expect few repeated "root" values:

import re
df['match'] = df.apply(lambda r: m.group()
                       if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
                       else None, axis=1)

output:

         word   root   match
0      replay   play  replay
1    replayed   play  replay
2    playable   play    play
3     thinker  think   think
4       think  think   think
5  thoughtful  think     NaN
  •  Tags:  
  • Related