I want to train a custom NER BERT model. Therefore I need to process my input data in a certain way.
My df_input looks like this:
df_input = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
DocumentText KeyWord1 KeyWord2 KeyWord3
This is a doc 12 ab doc 12 ab
document 13 a xx document 13 xx
....
All the text in the DocumentText column should be tokenized. Then all the tokens should receive the tag O and each token that matches with a KeyWord column should receive the tag corresponding to the column name.
What it should look like:
Word DocNr Tag
This 1 O
is 1 O
a 1 O
doc 1 KeyWord1
12 1 KeyWord2
ab 1 KeyWord3
document 2 KeyWord1
13 2 KeyWord2
a 2 O
xx 2 KeyWord3
I have code working but it is very slow; it takes many hours. So after using a for-loop I tried the apply method with a lambda function but I'm getting stuck on that one because it gives a Series object back with a DataFrame for each documemt on each row.
def preprocess(doctext, docnr, keyword1, keyword2, keyword3):
df1 = pd.DataFrame(columns = ['Word'])
df1['Word'] = nltk.word_tokenize(str(doctext))
df1['DocNR'] = docnr
df1['Tag'] = 'O'
df1['Tag'][df1['Word'] == keyword1] = 'KeyWord1'
df1['Tag'][df1['Word'] == keyword2] = 'KeyWord2'
df1['Tag'][df1['Word'] == keyword3] = 'KeyWord3'
return df1
for i in range(0, 50000):
try:
df = df.append(preprocess(df_input['DocumentText'][i],
i 1,
df_input['KeyWord1'][i],
df_input['KeyWord2'][i],
df_input['KeyWord3'][i]),
ignore_index=True)
pd.DataFrame(df_input.apply(lambda row: preprocess(row['DocumentText'],
row.name,
row['KeyWord1'],
row['KeyWord2'],
row['KeyWord3']),
axis=1))[0]
Are there any other ways to achieve this result quickly?
CodePudding user response:
You could try first tokenizing each entry with df.apply and then matching the words to keywords:
import pandas as pd
import nltk
nltk.download('punkt')
df1 = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
df = pd.DataFrame(data={'word': df1.DocumentText.apply(nltk.word_tokenize)})
df.index = 1
df = df.explode('word')
df = df.rename_axis('DocNr')
df['Tag'] = 0
df.loc[df['word'].isin(df1['KeyWord1'].to_numpy()), 'Tag'] = 'KeyWord1'
df.loc[df['word'].isin(df1['KeyWord2'].to_numpy()), 'Tag'] = 'KeyWord2'
df.loc[df['word'].isin(df1['KeyWord3'].to_numpy()), 'Tag'] = 'KeyWord3'
word Tag
DocNr
1 This 0
1 is 0
1 a 0
1 doc KeyWord1
1 12 KeyWord2
1 ab KeyWord3
2 document KeyWord1
2 13 KeyWord2
2 a 0
2 xx KeyWord3
CodePudding user response:
Use df.explode with Series.map for better performance:
In [736]: df_input.DocumentText = df_input.DocumentText.str.split()
In [713]: x = df_input.explode('DocumentText')
In [715]: y = x.iloc[:, 1:].drop_duplicates()
In [716]: d = {i:y[i].values.tolist() for i in y.columns}
In [733]: d = {i:k for k,v in d.items() for i in v}
In [724]: x['tags'] = x.DocumentText.map(d).fillna(0)
In [726]: x['DocNr'] = x.index 1
In [730]: res = x[['DocumentText', 'DocNr', 'tags']].reset_index(drop=True)
In [731]: res
Out[731]:
DocumentText DocNr tags
0 This 1 0
1 is 1 0
2 a 1 0
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 0
9 xx 2 KeyWord3
CodePudding user response:
This should be pretty fast:
e = df.assign(DocumentText=df['DocumentText'].str.split('\s ')).explode('DocumentText')
keywords = e.filter(like='KeyWord')
col_idxes = np.sum((e['DocumentText'].to_numpy()[:, None] == keywords.to_numpy()) * np.arange(1,keywords.shape[1] 1), axis=1)
tags = np.array(['O', *keywords.columns])[col_idxes]
out = e[['DocumentText']].assign(DocNr=e.index 1, Tag=tags).reset_index(drop=True)
Output:
>>> out
DocumentText DocNr Tag
0 This 1 O
1 is 1 O
2 a 1 O
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 O
9 xx 2 KeyWord3
