I have the following dataframe:
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
I am trying to do a statistic that contains the number of questions answered by each group (it doesn't matter how many times), the total number of questions, the number of questions answered by both groups and the ratio between mutual questions and total questions:
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
What I did was actually counting unique values "manually" with groupby, nunique etc, because it's simple when the group column has only two distinct values. However, is it possible to compare pairs of two groups and build dataframes with one row like above if my dataframe has many groups? Example:
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
f g3 True abc
f g3 True www
g g3 False xyz
h g4 True www
h g4 True qqq
i g4 False xyz
So I would want to compare g1 with g2, g1 with g3, g1 with g4, g2 with g3 and g3 with g4.
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
g1 g3 3 3 4 2 0.50
g1 g4 3 3 4 1 0.25
g2 g3 1 3 4 1 0.25
g2 g4 1 3 4 1 0.25
g3 g4 3 3 4 2 0.50
CodePudding user response:
Here's an attempt. First, find the set of questions answered for each group:
>>> g = df.groupby('group')['question'].apply(set)
>>> g
group
g1 {xyz, abc, def}
g2 {xyz}
g3 {xyz, abc, www}
g4 {qqq, xyz, www}
Name: question, dtype: object
Then, use itertools to get the combinations you want, computing the indices for each desired column:
import itertools
g = df.groupby('group')['question'].apply(set) # as above
rows = []
total = len(df.question.unique())
for a, b in itertools.combinations(g.index, 2):
rows.append({'Team1':a,
'Team2':b,
'NumberQuestionsT1':len(g[a]),
'NumberQuestionsT2':len(g[b]),
'AllQuestions': total,
'CommonQuestions': len(g[a] & g[b]),
'Ratio': len(g[a] & g[b]) / total})
output = pd.DataFrame(rows)
Giving:
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions \
0 g1 g2 3 1 5
1 g1 g3 3 3 5
2 g1 g4 3 3 5
3 g2 g3 1 3 5
4 g2 g4 1 3 5
5 g3 g4 3 3 5
CommonQuestions Ratio
0 1 0.2
1 2 0.4
2 1 0.2
3 1 0.2
4 1 0.2
5 2 0.4
I've replace the AllQuestions calculation with the one you commented. Now, it seems like the difference is that your example input has 5 question types (abc, def, qqq, www, xyz)?
I couldn't think of a way to do this without itertools, and I am also using a for loop for the initial determination of every set of questions. So I reckon there could be a significantly faster method, but this may be sufficient given the size of your data.
