Process input data to a correct format for a custom NER BERT model-CodePudding

I want to train a custom NER BERT model. Therefore I need to process my input data in a certain way.

My df_input looks like this:

df_input = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
                         'KeyWord1' : ['doc', 'document'],
                         'KeyWord2' : ['12', '13'],
                         'KeyWord3' : ['ab', 'xx']
                        })

DocumentText                KeyWord1       KeyWord2      KeyWord3
This is a doc 12 ab         doc            12           ab
document 13 a xx            document       13            xx
....

All the text in the DocumentText column should be tokenized. Then all the tokens should receive the tag O and each token that matches with a KeyWord column should receive the tag corresponding to the column name.

What it should look like:

Word       DocNr      Tag
This       1          O
is         1          O
a          1          O
doc        1          KeyWord1
12         1          KeyWord2
ab         1          KeyWord3
document   2          KeyWord1
13         2          KeyWord2
a          2          O
xx         2          KeyWord3

I have code working but it is very slow; it takes many hours. So after using a for-loop I tried the apply method with a lambda function but I'm getting stuck on that one because it gives a Series object back with a DataFrame for each documemt on each row.

def preprocess(doctext, docnr, keyword1, keyword2, keyword3):
    df1 = pd.DataFrame(columns = ['Word'])
    df1['Word'] = nltk.word_tokenize(str(doctext))
    df1['DocNR'] = docnr
    df1['Tag'] = 'O'
    df1['Tag'][df1['Word'] == keyword1] = 'KeyWord1'
    df1['Tag'][df1['Word'] == keyword2] = 'KeyWord2'
    df1['Tag'][df1['Word'] == keyword3] = 'KeyWord3'
    return df1

for i in range(0, 50000):
    try:
        df = df.append(preprocess(df_input['DocumentText'][i], 
                                  i 1,
                                  df_input['KeyWord1'][i],
                                  df_input['KeyWord2'][i],
                                  df_input['KeyWord3'][i]),
                                  ignore_index=True)

pd.DataFrame(df_input.apply(lambda row: preprocess(row['DocumentText'], 
                                        row.name,
                                        row['KeyWord1'],
                                        row['KeyWord2'],
                                        row['KeyWord3']),
                                        axis=1))[0]

Are there any other ways to achieve this result quickly?

CodePudding user response：

You could try first tokenizing each entry with df.apply and then matching the words to keywords:

import pandas as pd
import nltk
nltk.download('punkt')
df1 = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
                         'KeyWord1' : ['doc', 'document'],
                         'KeyWord2' : ['12', '13'],
                         'KeyWord3' : ['ab', 'xx']
                        })

df = pd.DataFrame(data={'word': df1.DocumentText.apply(nltk.word_tokenize)})
df.index  = 1
df = df.explode('word')
df = df.rename_axis('DocNr')
df['Tag'] = 0
df.loc[df['word'].isin(df1['KeyWord1'].to_numpy()), 'Tag'] = 'KeyWord1'
df.loc[df['word'].isin(df1['KeyWord2'].to_numpy()), 'Tag'] = 'KeyWord2'
df.loc[df['word'].isin(df1['KeyWord3'].to_numpy()), 'Tag'] = 'KeyWord3'

           word       Tag
DocNr                    
1          This         0
1            is         0
1             a         0
1           doc  KeyWord1
1            12  KeyWord2
1            ab  KeyWord3
2      document  KeyWord1
2            13  KeyWord2
2             a         0
2            xx  KeyWord3

CodePudding user response：

Use df.explode with Series.map for better performance:

In [736]: df_input.DocumentText = df_input.DocumentText.str.split()

In [713]: x = df_input.explode('DocumentText')
In [715]: y = x.iloc[:, 1:].drop_duplicates()

In [716]: d = {i:y[i].values.tolist() for i in y.columns}

In [733]: d = {i:k for k,v in d.items() for i in v}

In [724]: x['tags'] = x.DocumentText.map(d).fillna(0)
In [726]: x['DocNr'] = x.index   1
In [730]: res = x[['DocumentText', 'DocNr', 'tags']].reset_index(drop=True)

In [731]: res
Out[731]: 
  DocumentText  DocNr      tags
0         This      1         0
1           is      1         0
2            a      1         0
3          doc      1  KeyWord1
4           12      1  KeyWord2
5           ab      1  KeyWord3
6     document      2  KeyWord1
7           13      2  KeyWord2
8            a      2         0
9           xx      2  KeyWord3

CodePudding user response：

This should be pretty fast:

e = df.assign(DocumentText=df['DocumentText'].str.split('\s ')).explode('DocumentText')
keywords = e.filter(like='KeyWord')
col_idxes = np.sum((e['DocumentText'].to_numpy()[:, None] == keywords.to_numpy()) * np.arange(1,keywords.shape[1] 1), axis=1)
tags = np.array(['O', *keywords.columns])[col_idxes]
out = e[['DocumentText']].assign(DocNr=e.index 1, Tag=tags).reset_index(drop=True)

Output:

>>> out
  DocumentText  DocNr       Tag
0         This      1         O
1           is      1         O
2            a      1         O
3          doc      1  KeyWord1
4           12      1  KeyWord2
5           ab      1  KeyWord3
6     document      2  KeyWord1
7           13      2  KeyWord2
8            a      2         O
9           xx      2  KeyWord3