Calculating percentages for multiple columns-CodePudding

I have a dataset with the following structure

index	candidato	Page Name	Post Created Date	Total Interactions	Likes	Shares	Comments	Love	Angry
0	António Costa	Observador	2022-01-03	4500	340	400	433	545	565

There are 9 different candidato (candidates) and 27 different Page Name

Full dataset can be found

What is the best way to achieve this with Pandas? Thank you in advance for your help.

Disclosure This question is to help in a non-for-profit project that analyzes media behaviour, and bias, towards Portuguese candidates to the 2022 general elections. The prior report was made using Google Sheets but analyzing the datasets with Python is the best way, since I plan on doing this every 3 months.

The GitHub repo can be found here, where you can access all datasets and code used.

CodePudding user response：

You can groupby 'Page Name' and 'candidato' then find the sum of each of 'Total Interactions','Likes','Comments','Shares','Love','Angry' for each page name and each candidate: totals.

Then use groupby on totals by the first index level (which is "page name") and transform sum function so that you get the sum for each page name transformed for totals and divide totals by it to find the percentages.

Finally join the two DataFrames for the final outcome.

totals = df.groupby(['Page Name','candidato'])[['Total Interactions','Likes','Comments','Shares','Love','Angry']].sum()
percentages = totals.groupby(level=0).transform('sum').rdiv(totals).mul(100).round(2)

out = totals.join(percentages, lsuffix='', rsuffix='_Percentages').reset_index()

This produces a DataFrame that can produce the plot in the question.

CodePudding user response：

After getting the data via:

!wget https://raw.githubusercontent.com/JorgeMiguelGomes/LEG2022_MediaMonitor/main/legislativas_2022_media_monitor_29jan2022/data_products/legislativas_2022_all_candidates_filtered.csv

and reading in csv to dataframe:

df = pd.read_csv("legislativas_2022_all_candidates_filtered.csv")
df = df.drop(columns=["Unnamed: 0"])

You can groupby Page Name and sum the values in the columns you mentioned:

col_list = ["Total Interactions","Likes","Shares","Comments","Love","Angry"]
totals = df.groupby("Page Name")[col_list].sum()

The first 3 lines will look like:

              Total Interactions  Likes  Shares  Comments  Love  Angry
Page Name                                                             
Agência Lusa                4048   1464     285      1011    79    127
CM TV                      21813   7934    1050      7824   401    627
CNN Portugal              216437  59776    8755    107879  3640   5198

To get percentages, you can do:

percentages = totals/df[col_list].sum()*100

Sample from result:

              Total Interactions      Likes  ...       Love      Angry
Page Name                                    ...                      
Agência Lusa            0.324301   0.403282  ...   0.408945   0.371085
CM TV                   1.747521   2.185548  ...   2.075784   1.832048
CNN Portugal           17.339582  16.466265  ...  18.842530  15.188172