Map column lists to dictionary and create new column with padded strings-CodePudding

Given this dataframe and word_index dictionary:

import pandas as pd

df = pd.DataFrame(data={'text_ids': [
                                     [1, 2, 3, 2, 7, 2, 8, 2, 0],
                                     [1, 2, 4, 2, 7, 2, 8, 2, 0],
                                     [1, 2, 5, 2, 6, 2, 8, 2, 0],
                                     [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]
                                    ]})

word_index = {0: '<eos>', 1: '<sos>', 2: '/s', 3: 'he', 4: 'she', 5:'they', 6:'love', 7:'loves', 8: 'cats', 9: 'we', 10: 'talking', 11: 'about', 12: '<pad>'}

How can I map each sequence in text_ids to its corresponding value(s) in word_index, while making sure that \s really creates spaces in each string? Also, I need to add <pad> tokens to each string that has a length smaller than the largest integer sequence.

Expected output:

                                 text_ids                                       text
0             [1, 2, 3, 2, 7, 2, 8, 2, 0]   <sos> he loves cats <eos><pad><pad><pad>
1             [1, 2, 4, 2, 7, 2, 8, 2, 0]  <sos> she loves cats <eos><pad><pad><pad>
2             [1, 2, 5, 2, 6, 2, 8, 2, 0]  <sos> they love cats <eos><pad><pad><pad>
3  [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]     <sos> we love talking about cats <eos>

CodePudding user response：

You could use map to assign the values from your dictionary. Ensure to first replace '\s' with ' '.

Then reshape your dataframe to wide format with pivot to ensure the same number of items and fillna the missing spots with "<pad>".

Finally aggregate to a string with apply and join to the original dataframe:

word_index[2] = ' '

df2 = df['text_ids'].explode().map(word_index).reset_index()

df.join(
 df2.assign(col=df2.groupby('index').cumcount())
    .pivot('col', 'index', 'text_ids')
    .fillna('<pad>')
    .apply(''.join)
    .rename('text')
)

output:

                                 text_ids                                       text
0             [1, 2, 3, 2, 7, 2, 8, 2, 0]   <sos> he loves cats <eos><pad><pad><pad>
1             [1, 2, 4, 2, 7, 2, 8, 2, 0]  <sos> she loves cats <eos><pad><pad><pad>
2             [1, 2, 5, 2, 6, 2, 8, 2, 0]  <sos> they love cats <eos><pad><pad><pad>
3  [1, 2, 9, 2, 6, 2, 10, 2, 11, 2, 8, 0]      <sos> we love talking about cats<eos>

Another option using apply:

word_index[2] = ' '

# padding values
l = df['text_ids'].str.len()
pad = (l.max()-l).mul(pd.Series(['<pre>']*len(l)))

df['text'] = df['text_ids'].apply(lambda s: ''.join(word_index[e] for e in s)) pad

CodePudding user response：

Another option:

(df["text_ids"]
    .explode()
    .map(word_index)
    .groupby(level=0)
    .apply(lambda q: " ".join(q)))