I have a very large pandas dataframe with two columns, A and B. For each row containing values a and b in columns A and B respectively, I'd like to find another row with values a' and b' so that the absolute difference between a and a' is as small as possible. I would like to create two new columns: a column containing the "distance" between the two rows (i.e., abs(a - a')), and a column containing b'.
Here are a couple of exmaples. Let's say we have the following dataframe:
df = pd.DataFrame({'A' : [1, 5, 7, 2, 3, 4], 'B' : [5, 2, 7, 5, 1, 9]})
The first row has (a, b) = (1, 5). The two new columns for
this row would contain the values 1 and 5. Why? Because the closest value to a = 1 is a' = 2, which occurs in the fourth row. The value of b' in that row is 5.
The second row has (a, b) = (5, 2). The two new columns for this row would contain the values 1 and 9. The closest value to a = 5 is a' = 4, which occurs in the last row. The corresponding value of b' in that row is 9.
If the value of a' that minimizes (a - a') isn't unique, ties can be broken arbitrarily (or you can keep all entries).
I believe I need to use the pandas.merge_asof function, which allows for approximate joining. I also think that I need to set merge_asof function's direction keyword argument to nearest, which will allow selecting the closest (in absolute distance) to the left dataframe's key.
I've read the entire documentation (with examples) for pandas.merge_asof, but forming the correct query is a little bit tricky for me.
CodePudding user response:
Use merge_asof with allow_exact_matches=False and direction='nearest' parameters, last for A1 subtract A column with absolute values:
df1 = df.sort_values('A')
df = pd.merge_asof(df1,
df1.rename(columns={'A':'A1', 'B':'B1'}),
left_on='A',
right_on='A1',
allow_exact_matches=False,
direction='nearest')
df['A1'] = df['A1'].sub(df['A']).abs()
print (df)
A B A1 B1
0 1 5 1 5
1 2 5 1 5
2 3 1 1 5
3 4 9 1 1
4 5 2 1 9
5 7 7 2 2
