I have some data in my df df that shows the 2 categories a user belongs to. For which I want to see the number of users for each category pair expressed as a %total of the row.
Original dataframe df:
------ ------ --------
| cat1 | cat2 | user |
------ ------ --------
| A | X | 687568 |
| A | Y | 68575 |
| B | Y | 56478 |
| A | X | 6587 |
| A | Y | 45678 |
| B | X | 5678 |
| B | X | 967 |
| A | X | 345 |
------ ------ --------
I convert this to a groupby df using:
df2 = df.groupby(['cat1', 'cat2']).agg({'user': 'nunique'}).reset_index().pivot(index='cat1', columns='cat2',values='user')to get the pairwise calculation of number of users per combination of categories (numbers here are made up):
------ ---- ----
| cat2 | X | Y |
------ ---- ----
| cat1 | | |
------ ---- ----
| A | 5 | 5 |
| B | 10 | 40 |
------ ---- ----
And I would like to convert the numbers to percent totals of the rows (Cat1), e.g. for the first row, 5/(5 5) = 0.5 and so on to give:
------ ----- -----
| cat2 | X | Y |
------ ----- -----
| cat1 | | |
| A | 0.5 | 0.5 |
| B | 0.2 | 0.8 |
------ ----- -----
Would I have to create a new column in my grouped df that contains the row-wise sum, and then iterate through each value in a row and divide it by that total?
CodePudding user response:
You can simplify your expression:
piv = df.pivot_table('user', 'cat1', 'cat2', aggfunc='nunique')
pct = piv.div(piv.sum(axis=1), axis=0)
Output:
>>> piv
cat2 X Y
cat1
A 3 2
B 2 1
>>> pct
cat2 X Y
cat1
A 0.600000 0.400000
B 0.666667 0.333333
