given two columns of a pandas dataframe:
import pandas as pd
df = {'word': ['replay','replayed','playable','thinker','think','thoughtful', 'ex)mple'],
'root': ['play','play','play','think','think','think', 'ex)mple']}
df = pd.DataFrame(df, columns= ['word','root'])
I'd like to extract the substring of column word that includes everything up to the end of the string in the corresponding column root or NaN if the string in root is not included in word. That is, the resulting dataframe would look as follows:
word root match
replay play replay
replayed play replay
playable play play
thinker think think
think think think
thoughtful think NaN
ex)mple ex)mple ex)mple
My dataframe has several thousand rows, so I'd like to avoid for-loops if necessary.
CodePudding user response:
You can use a regex with str.extract in a groupby apply:
import re
df['match'] = (df.groupby('root')['word']
.apply(lambda g: g.str.extract(f'^(.*{re.escape(g.name)})'))
)
Or, if you expect few repeated "root" values:
import re
df['match'] = df.apply(lambda r: m.group()
if (m:=re.match(f'.*{re.escape(r["root"])}', r['word']))
else None, axis=1)
output:
word root match
0 replay play replay
1 replayed play replay
2 playable play play
3 thinker think think
4 think think think
5 thoughtful think NaN
