I have a problem. I want to check whether a certain regex occurs in a text (This regex will become more complex later.). Unfortunately, my code snippet runs, but it takes a long time. How could I rewrite the code to make it faster and more efficient?
If the element is present in the text, the code number of the respective element should be found and written into a new column. If it is not present, 999 should be written
Dataframe
customerId text element code
0 1 Something with Cat cat 0
1 3 That is a huge dog dog 1
2 3 Hello agian mouse 2
Code snippet
import pandas as pd
import copy
import re
d = {
"customerId": [1, 3, 3],
"text": ["Something with Cat", "That is a huge dog", "Hello agian"],
"element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)
def f(x):
match = 999
for element in df['element'].unique():
check = bool(re.search(element, x['text'], re.IGNORECASE))
if(check):
#print(forwarder)
match = df['code'].loc[df['element']== element].iloc[0]
break
x['test'] = match
return x
#print(match)
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)
Intended output
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
You can use pandas.str.contains then use numpy.where to fill with df['code'] and 999.
import numpy as np
mask = df['text'].str.contains('|'.join(df['element']), case=False)
df['test'] = np.where(mask, df['code'], 999)
print(df)
But if you want to get the output for "text": ["Something with Dog", "That is a huge Cat", "Hello agian"] as [1,0,999]. You can create dict with element and code. If element with regex search exist use code value in Dict or replace 999.
import re
dct = dict(zip(df['element'].str.lower(), df['code']))
pattern = re.compile("|".join(dct.keys()), re.IGNORECASE)
df['test'] = df['text'].apply(lambda x: dct[pattern.search(x).group(0).lower()] if pattern.search(x) else 999)
print(df)
Output:
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge Dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
You can use pandas apply to iterate through all the text and check if element exist in text. Here is one of the solution using numpy
import numpy as np
d = {
"customerId": [1, 3, 3],
"text": ["Something with Cat", "That is a huge dog", "Hello agian"],
"element": ['cat', 'dog', 'mouse']
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
df['test'] = np.where(df.apply(lambda x: x.element.lower() in x.text.lower(), axis=1), df['code'], 999)
Output :
df
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
You can also do the same thing in lamda function using df.apply
df['test'] = df.apply(lambda x: x.code if x.element.lower() in x.text.lower() else 999, axis=1)
This gives us the same thing
df
customerId text element code test
0 1 Something with Cat cat 0 0
1 3 That is a huge dog dog 1 1
2 3 Hello agian mouse 2 999
CodePudding user response:
Use the already written functions of pandas to make it a lot faster.
# ...
for element in df['element'].unique():
matching = df[df['text'].str.match(element) == True]
# ...
The matching variable contains all the rows that are matching with the given regex code (element).
Also, you can read more about pandas and regex in this excellent site: https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/.
