I have a dataframe df1:
| ID | item |
|---|---|
| 11111 | chair |
| 11112 | desk¥blue chair |
| 11113 | bed¥pen¥cable |
| 11114 | mug¥old sofa¥toy |
| 11115 | old mug¥wine glass |
| 11116 | blue chair¥old mug |
etc.
and a large dictionary item_dict (from csv) which categorize some but not all items into categories:
| item | category |
|---|---|
| chair | furniture |
| blue chair | furniture |
| mug | tableware |
| old mug | tableware |
| wine glass | tableware |
etc.
The goal is to make a new column in df1 of all category of interest for items, if any, in each row:
| ID | item | category |
|---|---|---|
| 11111 | chair | furniture |
| 11112 | desk¥blue chair | furniture |
| 11113 | bed¥pen¥cable | |
| 11114 | mug¥old sofa¥toy | mug |
| 11115 | old mug¥wine glass | mug |
| 11116 | blue chair¥old mug | furniture, mug |
I’ve tried
df1[‘category’] = df1[‘item’].replace(item_dict)
which will only replace values that only have a single item;
adding regex=True would allow multi-item entries to be replaced but would turn desk¥blue chair into desk¥blue furniture bc a shorter one existed.
I know replace isn’t the best option since I want the new column to only contain categories of interest without other item names, but am unsure of what to use.
CodePudding user response:
You can try this:
# !pip install -U swifter #### only in case of very large dataframe
import swifter
delimiter = '¥'
df1['category'] = df1['item'].swifter.apply(
lambda _: delimiter.join([item_dict.get(el, '') for el in _.split(delimiter)])
)
