Updated. Instead of
dictdata, I change for adataframeas input
I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.
The criteria of my analysis is based on whether certain words appear in a certain cell.
I believe I will be more successful if I know which words are most relevant in terms of values...
Foo data to be used as an example:
data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
'cost': [1000,1200,300,100]}
df_data = pd.DataFrame(data)
| product | cost | |
|---|---|---|
| 0 | Dell Notebook I7 | 1000 |
| 1 | Dell Notebook I3 | 1200 |
| 2 | Logitech mx keys | 300 |
| 3 | Logitech mx 2 | 100 |
Basically, the column product shoes the product an description.
In the column cost shows the product cost.
What I want:
I would like to create another dataframe like this:
Desired Output:
| unique_words | total_cost_for_unique_word | |
|---|---|---|
| 1 | Dell | 2200 |
| 4 | Logitech | 2200 |
| 5 | Notebook | 2200 |
| 2 | I3 | 1200 |
| 3 | I7 | 1000 |
| 7 | mx | 400 |
| 6 | keys | 300 |
| 0 | 2 | 100 |
- Column
unique_wordswith the list of each word that appears in the columnproduct. - Column
total_cost_for_unique_wordwith the sum of the values of products that contain that word.
I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.
If by any chance it has already been answered, please let me know and I will delete the post.
Thank you all.
CodePudding user response:
You can split, explode, groupby.agg:
df = pd.DataFrame(data)
(df
.assign(unique_words=df['product'].str.split())
.explode('unique_words')
.groupby('unique_words', as_index=False)
.agg(**{'total cost': ('cost' ,'sum')})
.sort_values('total cost', ascending=False, ignore_index=True)
)
Output:
unique_words total cost
0 Dell 2200
1 Notebook 2200
2 I3 1200
3 I7 1000
4 Logitech 400
5 mx 400
6 keys 300
7 2 100
CodePudding user response:
If you first split the product into a list of all words (default is " "):
df["product"] = df["product"].str.split()
You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:
df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})
