Pandas - Sum for each unique word-CodePudding

Updated. Instead of dict data, I change for a dataframe as input

I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.

The criteria of my analysis is based on whether certain words appear in a certain cell.

I believe I will be more successful if I know which words are most relevant in terms of values...

Foo data to be used as an example:

data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
         'cost': [1000,1200,300,100]}

df_data = pd.DataFrame(data)

	product	cost
0	Dell Notebook I7	1000
1	Dell Notebook I3	1200
2	Logitech mx keys	300
3	Logitech mx 2	100

Basically, the column product shoes the product an description. In the column cost shows the product cost.

What I want:

I would like to create another dataframe like this:

Desired Output:

	unique_words	total_cost_for_unique_word
1	Dell	2200
4	Logitech	2200
5	Notebook	2200
2	I3	1200
3	I7	1000
7	mx	400
6	keys	300
0	2	100

Column unique_words with the list of each word that appears in the column product.
Column total_cost_for_unique_word with the sum of the values of products that contain that word.

I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.

If by any chance it has already been answered, please let me know and I will delete the post.

Thank you all.

CodePudding user response：

You can split, explode, groupby.agg:

df = pd.DataFrame(data)
(df
 .assign(unique_words=df['product'].str.split())
 .explode('unique_words')
 .groupby('unique_words', as_index=False)
 .agg(**{'total cost': ('cost' ,'sum')})
 .sort_values('total cost', ascending=False, ignore_index=True)
 )

Output:

  unique_words  total cost
0         Dell        2200
1     Notebook        2200
2           I3        1200
3           I7        1000
4     Logitech         400
5           mx         400
6         keys         300
7            2         100

CodePudding user response：

If you first split the product into a list of all words (default is " "):

df["product"] = df["product"].str.split()

You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:

df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})