I have seen a lot of similar questions but none seem to work for my case. I'm pretty sure this is just a groupby transform but I keep getting KeyError along with axis issues. I am trying to groupby filename and check count where pred != gt.
For example Index 2 is the only one for f1.wav so 1, and Index (13,14,18) for f2.wav so 3.
df = pd.DataFrame([{'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f2.wav'}])
pred gt filename
0 0 0 f1.wav
1 0 0 f1.wav
2 2 0 f1.wav
3 0 0 f1.wav
4 0 0 f1.wav
5 0 0 f1.wav
6 0 0 f1.wav
7 0 0 f1.wav
8 0 0 f1.wav
9 0 0 f1.wav
10 0 0 f2.wav
Expected output
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 3
11 0 0 f2.wav 3
12 2 2 f2.wav 3
13 0 2 f2.wav 3
14 0 2 f2.wav 3
15 0 0 f2.wav 3
16 0 0 f2.wav 3
17 2 2 f2.wav 3
18 0 2 f2.wav 3
19 2 0 f2.wav 3
I was thinking
df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1)
but I get TypeError: Transform function invalid for data types
CodePudding user response:
.transform operates on each column individually, so you won't be able to access both 'pred' and 'gt' in a transform operation.
This leaves you with 2 options:
- aggregate and reindex or join back to the original shape
- pre-compute the boolean array and
.transformon that
approach 2 will probably be the fastest here:
df['counts'] = (
(df['pred'] != df['gt'])
.groupby(df['filename']).transform('sum')
)
print(df)
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 4
11 0 0 f2.wav 4
12 2 2 f2.wav 4
13 0 2 f2.wav 4
14 0 2 f2.wav 4
15 0 0 f2.wav 4
16 0 0 f2.wav 4
17 2 2 f2.wav 4
18 0 2 f2.wav 4
19 2 0 f2.wav 4
Note that f2.wav has 4 instances where 'pre' != 'gt' (index 13, 14, 18, 19)
CodePudding user response:
Considering that df is the dataframe OP shares in the question, in order to groupby filename and check count where pred != gt, one can use pandas.DataFrame.groupby and
pandas.DataFrame.apply as follows
df2 = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']])
[Out]:
pred gt filename
filename
f1.wav 2 2 0 f1.wav
f2.wav 13 0 2 f2.wav
14 0 2 f2.wav
18 0 2 f2.wav
19 2 0 f2.wav
Assuming one wants to count the number of occurrences for each filename, as, after the previous operation, filename is both an index level and a column label, which is ambiguous, and considering that OP wants to have a column named count to count the number of each item in each group, one will have to groupby level (one of the various parameters one can pass), and, finally, use pandas.core.groupby.GroupBy.cumcount. (Note: As opposed to the accepted answer, this approach will count sequentially)
df2['count'] = df2.groupby(level=0).cumcount() 1 # The 1 is to make the count start at 1 instead of 0.
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
A one-liner would look like the following
df2['count'] = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']]).groupby(level=0).cumcount() 1
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
If having the count in a separate column is not a requirement, considering df2 as the dataframe after the first operation mentioned in this answer (when df2 was created), then one can simply use the following (which gives a more high-level overview)
df3 = df2.groupby(level=0).count().iloc[:, 0]
[Out]:
filename
f1.wav 1
f2.wav 4
Name: pred, dtype: int64
