Maybe a bit of beginner's question, but my mind is really stuck.
I have a dataframe with certain values in a column called x, split into two groups.
x group
1 1.7 a
2 0 b
3 2.3 b
4 2.7 b
5 8.6 a
6 5.4 b
7 4.2 a
8 5.7 b
My purpose is for each row, to count how many rows of the other group have a value greater than the current one. So to make it more clear, for the first row (group a) I am looking to find how many rows of group b are greater than 1.7 (the answer is 4). The end result should look like :
x group result
1 1.7 a 4
2 0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
I have several rows in the dataframe, so ideally I would also like a relatively fast solution.
CodePudding user response:
Use np.searchsorted:
df['result'] = 0
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
df.loc[a.index, 'result'] = len(b) - np.searchsorted(np.sort(b), a)
df.loc[b.index, 'result'] = len(a) - np.searchsorted(np.sort(a), b)
Output:
>>> df
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
Performance for 130K records
>>> %%timeit
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
len(b) - np.searchsorted(np.sort(b), a)
len(a) - np.searchsorted(np.sort(a), b)
31.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Setup:
N = 130000
df = pd.DataFrame({'x': np.random.randint(1, 1000, N),
'group': np.random.choice(['a', 'b'], N, p=(0.7, 0.3))})
CodePudding user response:
Here is one way. Based on ranking in descending order the x values per group, and merge_asof df with itself, after exchanging the group name to merge a with the ranked values in b, vice versa.
# needed for the merge_asof
df = df.sort_values('x')
res = (
pd.merge_asof(
df.reset_index(), # to keep original index order
df.assign(
# to compare a with b in the merge
group = df['group'].map({'a':'b', 'b':'a'}),
# rank descending to get the number of number above current number
result = df.groupby('group')['x'].rank(ascending=False)),
by='group', # same group first, knowing you exchange groups in second df
on='x', direction='forward') # look forward on x to get the rank
# complete the result column
.fillna({'result':0})
.astype({'result':int})
# for cosmetic
.set_index('index')
.rename_axis(None)
.sort_index()
)
print(res)
# x group result
# 1 1.7 a 4
# 2 0.0 b 3
# 3 2.3 b 2
# 4 2.7 b 2
# 5 8.6 a 0
# 6 5.4 b 1
# 7 4.2 a 2
# 8 5.7 b 1
CodePudding user response:
You can sort the values and use masks to cumsum per the other group:
df2 = df.sort_values(by='x', ascending=False)
m = df2['group'].eq('a')
df['result'] = m.cumsum().mask(m).fillna((~m).cumsum().where(m)).astype(int)
Output:
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
CodePudding user response:
A quick solution is to use pandas' DataFrame.apply method.
df['result'] = df.apply(lambda row: df[(df['group'] != row['group']) & (df['x'] > row['x'])].x.count(), axis=1)
CodePudding user response:
This should be pretty efficient, just one sort of all x and then just calculating cumsums
df2 = df.sort_values('x', ascending=False).reset_index()
df2['acount'] = (df['group'] == 'a').cumsum()
df2['bcount'] = (df['group'] == 'b').cumsum()
df2 = df2.fillna(0)
df2
at this point df2 looks like this:
index x group acount bcount
0 5 8.6 a 0.0 0.0
1 8 5.7 b 1.0 0.0
2 6 5.4 b 1.0 1.0
3 7 4.2 a 1.0 2.0
4 4 2.7 b 1.0 3.0
5 3 2.3 b 2.0 3.0
6 1 1.7 a 2.0 4.0
7 2 0.0 b 3.0 4.0
now restore the index and choose acount or bcount depending on the group:
df2 = df2.set_index('index').sort_index()
df2['result'] = np.where(df['group']=='a', df2['bcount'],df2['acount']).astype(int)
df2[['x','result']]
final result
x group result
index
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 1
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
performance (on the same 130000 row test as @Corralien, not the same hardware obv)
65.4 ms ± 957 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Not too different from Corralien's solution, but you can use broadcasting to do a check of all elements in group 'a' against all elements in group 'b' and count how many satisfy the condition. Then join the result back.
import pandas as pd
import numpy as np
a = df.loc[df['group'] == 'a', 'x']
b = df.loc[df['group'] == 'b', 'x']
result = pd.concat([
pd.Series(np.sum(a.to_numpy() < b.to_numpy()[:, None], axis=0), index=a.index),
pd.Series(np.sum(b.to_numpy() < a.to_numpy()[:, None], axis=0), index=b.index)])
df['result'] = result
x group result
1 1.7 a 4
2 0.0 b 3
3 2.3 b 2
4 2.7 b 2
5 8.6 a 0
6 5.4 b 1
7 4.2 a 2
8 5.7 b 1
