Conditional count of number of rows with value greater than the current one-CodePudding

Maybe a bit of beginner's question, but my mind is really stuck.

I have a dataframe with certain values in a column called x, split into two groups.

   x     group
1  1.7   a
2  0     b
3  2.3   b
4  2.7   b
5  8.6   a
6  5.4   b
7  4.2   a
8  5.7   b

My purpose is for each row, to count how many rows of the other group have a value greater than the current one. So to make it more clear, for the first row (group a) I am looking to find how many rows of group b are greater than 1.7 (the answer is 4). The end result should look like :

   x     group   result
1  1.7   a       4
2  0     b       3
3  2.3   b       2
4  2.7   b       2
5  8.6   a       0
6  5.4   b       1
7  4.2   a       2
8  5.7   b       1

I have several rows in the dataframe, so ideally I would also like a relatively fast solution.

CodePudding user response：

Use np.searchsorted:

df['result'] = 0

a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']

df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)

Output:

>>> df
     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

Performance for 130K records

>>> %%timeit
    a = df.loc[df['group'] == 'a', 'x']
    b = df.loc[df['group'] == 'b', 'x']
    len(b) - np.searchsorted(np.sort(b), a)
    len(a) - np.searchsorted(np.sort(a), b)

31.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Setup:

N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
                   'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})

CodePudding user response：

Here is one way. Based on ranking in descending order the x values per group, and merge_asof df with itself, after exchanging the group name to merge a with the ranked values in b, vice versa.

# needed for the merge_asof
df = df.sort_values('x')

res = (
    pd.merge_asof(
        df.reset_index(), # to keep original index order
        df.assign(
            # to compare a with b in the merge
            group = df['group'].map({'a':'b', 'b':'a'}), 
            # rank descending to get the number of number above current number
            result = df.groupby('group')['x'].rank(ascending=False)),
        by='group', # same group first, knowing you exchange groups in second df
        on='x', direction='forward') # look forward on x to get the rank
      # complete the result column
      .fillna({'result':0})
      .astype({'result':int})
      # for cosmetic
      .set_index('index')
      .rename_axis(None)
      .sort_index()
)
print(res)
#      x group  result
# 1  1.7     a       4
# 2  0.0     b       3
# 3  2.3     b       2
# 4  2.7     b       2
# 5  8.6     a       0
# 6  5.4     b       1
# 7  4.2     a       2
# 8  5.7     b       1

CodePudding user response：

You can sort the values and use masks to cumsum per the other group:

df2 = df.sort_values(by='x', ascending=False)
m = df2['group'].eq('a')
df['result'] = m.cumsum().mask(m).fillna((~m).cumsum().where(m)).astype(int)

Output:

     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1

CodePudding user response：

A quick solution is to use pandas' DataFrame.apply method.

df['result'] = df.apply(lambda row: df[(df['group'] != row['group']) & (df['x'] > row['x'])].x.count(), axis=1)

CodePudding user response：

This should be pretty efficient, just one sort of all x and then just calculating cumsums

df2 = df.sort_values('x', ascending=False).reset_index()
df2['acount'] = (df['group'] == 'a').cumsum()
df2['bcount'] = (df['group'] == 'b').cumsum()
df2 = df2.fillna(0)
df2

at this point df2 looks like this:

    index   x   group   acount  bcount
0   5       8.6 a       0.0     0.0
1   8       5.7 b       1.0     0.0
2   6       5.4 b       1.0     1.0
3   7       4.2 a       1.0     2.0
4   4       2.7 b       1.0     3.0
5   3       2.3 b       2.0     3.0
6   1       1.7 a       2.0     4.0
7   2       0.0 b       3.0     4.0

now restore the index and choose acount or bcount depending on the group:

df2 = df2.set_index('index').sort_index()
df2['result'] = np.where(df['group']=='a', df2['bcount'],df2['acount']).astype(int)
df2[['x','result']]

final result


    x   group   result
index           
1   1.7 a       4
2   0.0 b       3
3   2.3 b       2
4   2.7 b       1
5   8.6 a       0
6   5.4 b       1
7   4.2 a       2
8   5.7 b       1

performance (on the same 130000 row test as @Corralien, not the same hardware obv)

65.4 ms ± 957 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

Not too different from Corralien's solution, but you can use broadcasting to do a check of all elements in group 'a' against all elements in group 'b' and count how many satisfy the condition. Then join the result back.

import pandas as pd
import numpy as np

a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']

result = pd.concat([
            pd.Series(np.sum(a.to_numpy() < b.to_numpy()[:, None], axis=0), index=a.index),
            pd.Series(np.sum(b.to_numpy() < a.to_numpy()[:, None], axis=0), index=b.index)])

df['result'] = result

     x group  result
1  1.7     a       4
2  0.0     b       3
3  2.3     b       2
4  2.7     b       2
5  8.6     a       0
6  5.4     b       1
7  4.2     a       2
8  5.7     b       1