Home > Enterprise >  How to create a new column based on string match of existing column in pandas
How to create a new column based on string match of existing column in pandas

Time:01-05

I am using my bank transactions data to bucket them into different categories.

This is the data frame df I have:

Date         Transaction             Debit    Credit    Balance
13/12/21     ECOM PUR/ZOMATO/xxx      200      -         30000
13/12/21     UPI/P2M/xxx/SWIGGY/      500      -         29500
14/12/21     NEFT/xxxxx/              -        30000     59500

I need to search for string matches in the transaction column and bucket them into categories in a new column.

E.g. SWIGGY and ZOMATO gets bucketed as food and NEFT is bucketed as salary

Desired output:

Date         Transaction             Debit    Credit    Balance        Categories
13/12/21     ECOM PUR/ZOMATO/xxx      200      -         30000          Food
13/12/21     UPI/P2M/xxx/SWIGGY/      500      -         29500          Food
14/12/21     NEFT/xxxxx/              -        30000     59500          Salary

Eg. in R, for string matching I would use something like

df[grepl('SWIGGY|ZOMATO', transactions)]$Categories <- 'Food'

How do I go about this in python? Really new to python, would appreciate the help. TIA :)

CodePudding user response:

Create a dict mapping of your categories then split your Transaction column and find match:

# Your categories here
categories = {'Food': ['SWIGGY', 'ZOMATO'],
              'Salary': ['NEFT']}

# create a reverse dict of categories
mappings = {v: k for k, l in categories.items() for v in l}

df['Categories'] = df['Transaction'].str.split('/').explode().map(mappings).dropna()

Output:

>>> df
       Date          Transaction Debit Credit  Balance Categories
0  13/12/21  ECOM PUR/ZOMATO/xxx   200      -    30000       Food
1  13/12/21  UPI/P2M/xxx/SWIGGY/   500      -    29500       Food
2  14/12/21          NEFT/xxxxx/     -  30000    59500     Salary

>>> categories
{'Food': ['SWIGGY', 'ZOMATO'], 'Salary': ['NEFT']}

>>> mappings
{'SWIGGY': 'Food',
 'ZOMATO': 'Food',
 'NEFT': 'Salary'}

Update: In the case of you have multiple match categories on one row and/or no categories, you can use:

categories = {'Food': ['SWIGGY', 'ZOMATO'],
              'Cream': ['ECOM PUR']}
mappings = {v: k for k, l in categories.items() for v in l}

df['Categories'] = \
    df['Transaction'].str.split('/').explode().map(mappings).dropna() \
                     .groupby(level=0).apply(lambda x: ','.join(set(x))) \
                     .reindex(df.index).fillna('Other')

Output:

>>> df
       Date          Transaction Debit Credit  Balance  Categories
0  13/12/21  ECOM PUR/ZOMATO/xxx   200      -    30000  Cream,Food
1  13/12/21  UPI/P2M/xxx/SWIGGY/   500      -    29500        Food
2  14/12/21          NEFT/xxxxx/     -  30000    59500       Other
  •  Tags:  
  • Related