How to move text from old column to newly created columns by using st.contains pandas-CodePudding

I want to move around description text column to newly created columns based on keywords in python.

For example, if keywords are 'Table', 'Fan', 'Chair'

Description(Given)       Keyword Table        Keyword Fan        Keyword Chair

The table is long        The table is long
The fan is nice                               The fan is nice
The fan is cheap                              The fan is cheap
The chair is brown                                               The chair is brown

I tried to use both str.contains() and str.findall(), but it gives either T|F boolean or just the keyword (ex. 'chair')

df['Keyword Table'] = df['Description'].str.contains('Table')

AND

keywords=['Table']
df['Keyword Table'] = df['Description'].str.findall((keywords)).apply(set)

CodePudding user response：

Here is a simple way using a regex with named capturing groups:

df = pd.DataFrame({'Desc': ['The table is long', 'The fan is nice', 'The fan is cheap', 'The chair is brown']})
words = ['table', 'fan', 'chair']

regex = '|'.join(f'(?P<{w}>.*{w}.*)' for w in words)
df.join(df['Desc'].str.extract(regex, expand=False).add_prefix('keyword_'))

NB. The named capturing groups cannot have special characters or spaces.If this is the case let me know and it is possible to change the name of the capturing group. Output:

                 Desc      keyword_table       keyword_fan       keyword_chair
0   The table is long  The table is long               NaN                 NaN
1     The fan is nice                NaN   The fan is nice                 NaN
2    The fan is cheap                NaN  The fan is cheap                 NaN
3  The chair is brown                NaN               NaN  The chair is brown

other option `get_dummies`

df = pd.DataFrame({'Desc': ['The table is long', 'The fan is nice', 'The fan is cheap', 'The chair is brown']})
words = ['table', 'fan', 'chair']

regex = '(%s)' % '|'.join(words)
df.join(pd.get_dummies(df['Desc'].str.extract(regex, expand=False))
          .mul(df['Desc'], axis=0)
          .add_prefix('keyword_')
        )

CodePudding user response：

Your boolean series can be used as index to slice your dataframe, like this:

df['Keyword Table'] = df[df['Description'].str.contains('Table', na = False)]['Description']

For a list of keywords, you can use apply:

keywords = ['Table', 'Fan', 'Chair']

df['Keywords'] = df[df['Description'].apply(lambda x: any(k in x for k in keywords))]['Description']

CodePudding user response：

Does this piece of code help?

df = pd.DataFrame({'Desc':['cat is black','dog is white']})
kw = ['cat','dog']
for k in kw:
   df[k   ' col'] = df.Desc.map(lambda s: s if k in s else '' )

Output is

other option get_dummies

other option `get_dummies`