How do I create a function that creates a new dataframe column based on search words that exist in a-CodePudding

I am trying to create a function that is flexible enough to output a string based on search words that are in an existing DataFrame column. I am getting an output but it seems every output after the first is chained with the previous outputs (previous outputs repeat with the new output). How can I correct this? I plan to expand the function to include more for loops. Maybe there is a more efficient way to do this.

# declarations
search_words = ['one', 'two', 'three']
l1 = []

#Function
def concat(text):
    for i in search_words[0:1]:
        if i in text:
            a = 'four'
            l1.append(a)
    for i in search_words[1:3]:
        if i in text:
            b = 'five'
            l1.append(b)
    listToStr = ' '.join(map(str, l1))
    return listToStr

# Test Dataframe
dftest = pd.DataFrame(data =['one filler two','two','filler','three one'], 
                      columns = ['col1'])

# Test output
dftest['col2'] = dftest['col1'].apply(lambda x: concat(x))
dftest

wrong output given:

    col1               col2
0   one filler two     four five
1   two                four five five
2   filler             four five five
3   three one          four five five four five

Desired output:

    col1               col2
0   one filler two     four five
1   two                five
2   filler             
3   three one          five four

CodePudding user response：

You have to define a new l1 each time you call concat:

def concat(text):
    l1 = []
    for i in search_words[0:1]:
        if i in text:
            l1.append('four')
    for i in search_words[1:3]:
        if i in text:
            l1.append('five')
    listToStr = ' '.join(l1)
    return listToStr

Also when you apply concat, you don't need lambda:

dftest['col2'] = dftest['col1'].apply(concat)

Output:

             col1       col2
0  one filler two  four five
1             two       five
2          filler           
3       three one  four five

CodePudding user response：

One simpler way might be:

dict_assignment = {
    "one": "four",
    "two": "five",
    "three": "five",
}

dftest["col2"] = dftest.col1.apply(
    lambda p: ' '.join(dict_assignment[w] for w in p.split() if w in search_words)
)

print(dftest)

#              col1       col2
# 0  one filler two  four five
# 1             two       five
# 2          filler           
# 3       three one  five four

CodePudding user response：

Using apply to call a function that contains a loop will be extremely inefficient.

Use vectorial code instead and a dictionary that maps the word, then join to the original dataframe.

d = {'one':'four', 'two':'five', 'three':'five'}

df2 = dftest.join(
 dftest['col1']
 .str.extractall(f"({'|'.join(d)})")[0]
 .map(d)
 .groupby(level=0).agg(' '.join)
 .rename('col2')
 )

NB. I used a simple regex here, however if you have special characters in your query words, you might need to escape them using re.escape. Please update the example of this is the case.

Output:

             col1       col2
0  one filler two  four five
1             two       five
2          filler        NaN
3       three one  five four