I am trying to create a function that is flexible enough to output a string based on search words that are in an existing DataFrame column. I am getting an output but it seems every output after the first is chained with the previous outputs (previous outputs repeat with the new output). How can I correct this? I plan to expand the function to include more for loops. Maybe there is a more efficient way to do this.
# declarations
search_words = ['one', 'two', 'three']
l1 = []
#Function
def concat(text):
for i in search_words[0:1]:
if i in text:
a = 'four'
l1.append(a)
for i in search_words[1:3]:
if i in text:
b = 'five'
l1.append(b)
listToStr = ' '.join(map(str, l1))
return listToStr
# Test Dataframe
dftest = pd.DataFrame(data =['one filler two','two','filler','three one'],
columns = ['col1'])
# Test output
dftest['col2'] = dftest['col1'].apply(lambda x: concat(x))
dftest
wrong output given:
col1 col2
0 one filler two four five
1 two four five five
2 filler four five five
3 three one four five five four five
Desired output:
col1 col2
0 one filler two four five
1 two five
2 filler
3 three one five four
CodePudding user response:
You have to define a new l1 each time you call concat:
def concat(text):
l1 = []
for i in search_words[0:1]:
if i in text:
l1.append('four')
for i in search_words[1:3]:
if i in text:
l1.append('five')
listToStr = ' '.join(l1)
return listToStr
Also when you apply concat, you don't need lambda:
dftest['col2'] = dftest['col1'].apply(concat)
Output:
col1 col2
0 one filler two four five
1 two five
2 filler
3 three one four five
CodePudding user response:
One simpler way might be:
dict_assignment = {
"one": "four",
"two": "five",
"three": "five",
}
dftest["col2"] = dftest.col1.apply(
lambda p: ' '.join(dict_assignment[w] for w in p.split() if w in search_words)
)
print(dftest)
# col1 col2
# 0 one filler two four five
# 1 two five
# 2 filler
# 3 three one five four
CodePudding user response:
Using apply to call a function that contains a loop will be extremely inefficient.
Use vectorial code instead and a dictionary that maps the word, then join to the original dataframe.
d = {'one':'four', 'two':'five', 'three':'five'}
df2 = dftest.join(
dftest['col1']
.str.extractall(f"({'|'.join(d)})")[0]
.map(d)
.groupby(level=0).agg(' '.join)
.rename('col2')
)
NB. I used a simple regex here, however if you have special characters in your query words, you might need to escape them using re.escape. Please update the example of this is the case.
Output:
col1 col2
0 one filler two four five
1 two five
2 filler NaN
3 three one five four
