Summarize multiple survey questions responses in a single table with percentages-CodePudding

I am a pandas novice working with survey data and would like to summarize % of positive responses to multiple questions in a single table. I have looked at groupby and pivot but still stuck.

Answers to survey questions are either 1 (yes) or 0 (no) I would like to:

group responses by age group
have the cells in the table represent the % of respondents for an age group that answer 1 (yes)

Sample data:

import pandas as pd

data = {'AgeBand':['25 to 34', '25 to 34', '35 to 44', '65 to 74'],'LikesJam':[1,0,0,1],'LikesPeanutButter':[0,0,0,1]}

df = pd.DataFrame(data)

x=df.groupby(['AgeBand'])[['LikesJam','LikesPeanutButter']].sum()
print(x)
     
          LikesJam  LikesPeanutButter
AgeBand                              
25 to 34         1                  0
35 to 44         0                  0
65 to 74         1                  1

Desired Output:

          LikesJam  LikesPeanutButter
AgeBand                              
25 to 34        50%                 0%
35 to 44         0%                 0%
65 to 74       100%               100%

My goal is to easily compare age groups across several questions - if there is a better way to approach the problem I welcome suggestions. Thanks!

CodePudding user response：

You're close with the groupby! There's no percentage function though, so you have to specify the transformation you want to apply to each group using agg:

>>> df.groupby("AgeBand").agg(lambda x: f"{int(x.sum() / x.count() * 100)}%")

         LikesJam LikesPeanutButter
AgeBand
25 to 34      50%                0%
35 to 44       0%                0%
65 to 74     100%              100%

Alternatively, you can create a function so you don't have to inline your transformation using lambda:

>>> def group_pct(g):
...     v = g.sum() / g.count()
...     return str(int(v * 100))   "%"
... 
>>> df.groupby("AgeBand").agg(group_pct)
         LikesJam LikesPeanutButter
AgeBand
25 to 34      50%                0%
35 to 44       0%                0%
65 to 74     100%              100%

CodePudding user response：

The % of respondents is the mean:

x = df.groupby('AgeBand')[['LikesJam', 'LikesPeanutButter']].mean()

          LikesJam  LikesPeanutButter
AgeBand                              
25 to 34       0.5                0.0
35 to 44       0.0                0.0
65 to 74       1.0                1.0

To format the values applymap can be used to apply a format string to each value in the resulting DataFrame:

x = (
    df.groupby('AgeBand')[['LikesJam', 'LikesPeanutButter']].mean()
        .applymap('{:.0%}'.format)
)

         LikesJam LikesPeanutButter
AgeBand                            
25 to 34      50%                0%
35 to 44       0%                0%
65 to 74     100%              100%

*Note: the data type has been modified from number (float) to string (object) and can no longer be used for mathematical operations.

Alternatively, an option context can be used to change the way that the values are displayed without changing the actual numeric values in the DataFrame:

x = df.groupby('AgeBand')[['LikesJam', 'LikesPeanutButter']].mean()

with pd.option_context('display.float_format', '{:.0%}'.format):
    print(x)

          LikesJam  LikesPeanutButter
AgeBand                              
25 to 34       50%                 0%
35 to 44        0%                 0%
65 to 74      100%               100%

*Note: only the display has been changed, the underlying values are still float and can still be used in any mathematical computations. Printing (or otherwise displaying) the DataFrame outside the option_context will still show the numeric values.