I have a dataset with the following structure
| index | candidato | Page Name | Post Created Date | Total Interactions | Likes | Shares | Comments | Love | Angry |
|---|---|---|---|---|---|---|---|---|---|
| 0 | António Costa | Observador | 2022-01-03 | 4500 | 340 | 400 | 433 | 545 | 565 |
There are 9 different candidato (candidates) and 27 different Page Name
What is the best way to achieve this with Pandas? Thank you in advance for your help.
Disclosure This question is to help in a non-for-profit project that analyzes media behaviour, and bias, towards Portuguese candidates to the 2022 general elections. The prior report was made using Google Sheets but analyzing the datasets with Python is the best way, since I plan on doing this every 3 months.
The GitHub repo can be found here, where you can access all datasets and code used.
CodePudding user response:
You can groupby 'Page Name' and 'candidato' then find the sum of each of 'Total Interactions','Likes','Comments','Shares','Love','Angry' for each page name and each candidate: totals.
Then use groupby on totals by the first index level (which is "page name") and transform sum function so that you get the sum for each page name transformed for totals and divide totals by it to find the percentages.
Finally join the two DataFrames for the final outcome.
totals = df.groupby(['Page Name','candidato'])[['Total Interactions','Likes','Comments','Shares','Love','Angry']].sum()
percentages = totals.groupby(level=0).transform('sum').rdiv(totals).mul(100).round(2)
out = totals.join(percentages, lsuffix='', rsuffix='_Percentages').reset_index()
This produces a DataFrame that can produce the plot in the question.
CodePudding user response:
After getting the data via:
!wget https://raw.githubusercontent.com/JorgeMiguelGomes/LEG2022_MediaMonitor/main/legislativas_2022_media_monitor_29jan2022/data_products/legislativas_2022_all_candidates_filtered.csv
and reading in csv to dataframe:
df = pd.read_csv("legislativas_2022_all_candidates_filtered.csv")
df = df.drop(columns=["Unnamed: 0"])
You can groupby Page Name and sum the values in the columns you mentioned:
col_list = ["Total Interactions","Likes","Shares","Comments","Love","Angry"]
totals = df.groupby("Page Name")[col_list].sum()
The first 3 lines will look like:
Total Interactions Likes Shares Comments Love Angry
Page Name
Agência Lusa 4048 1464 285 1011 79 127
CM TV 21813 7934 1050 7824 401 627
CNN Portugal 216437 59776 8755 107879 3640 5198
To get percentages, you can do:
percentages = totals/df[col_list].sum()*100
Sample from result:
Total Interactions Likes ... Love Angry
Page Name ...
Agência Lusa 0.324301 0.403282 ... 0.408945 0.371085
CM TV 1.747521 2.185548 ... 2.075784 1.832048
CNN Portugal 17.339582 16.466265 ... 18.842530 15.188172

