I have a couple of decent sized dataframes that look like:
df_B
id start_time end_time side cost
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111
7535 2021-01-01 16:40:00.100000 2021-01-01 16:55:00.100000 BUY 124
9999 2021-01-01 16:44:00.100000 2021-01-01 16:45:00.100000 BUY 128
df_S
id start_time end_time side cost
5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
4532 2021-01-01 16:20:00.100000 2021-01-01 16:50:00.100000 SELL 122
5827 2021-01-01 16:30:00.100000 2021-01-01 16:35:00.100000 SELL 123
I would like to create a new dataframe such that: for each id in df_B: if df_S.cost <= df_B.cost & df_S.start_time <= df_B.end_time
Eg: Desired output:
id start_time end_time side cost id_S start_time_S end_time_S side_S cost_S
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100 5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111 4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
7535 2021-01-01 16:40:00.100000 2021-01-01 16:55:00.100000 BUY 124
9999 2021-01-01 16:44:00.100000 2021-01-01 16:45:00.100000 BUY 128
Could you please advise how I can efficiently write this, for a large dataframe
CodePudding user response:
you can create a mask then merge outer on index.
If you don't have an index or you are using id as index, just do df.reset_index()
create a mask
mask = df_s.cost <= df_b.cost & df_s.start_time <= df_b.end_time
then merge with the mask on
pd.merge(df_b[mask],df_s[mask],suffixes=('_B','_S'),how = 'left', left_index=True, right_index=True)
Unfortunately this method will drop the row that the conditions are untrue for any one of the 2 dataframes.
the result should be
cost id_S start_time_S end_time_S side_S cost_S
1234 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 BUY 100 5366 2021-01-01 16:00:00.100000 2021-01-01 16:02:00.100000 SELL 100
1564 2021-01-01 16:05:00.100000 2021-01-01 16:10:00.100000 BUY 111 4533 2021-01-01 16:05:00.100000 2021-01-01 16:08:00.100000 SELL 105
CodePudding user response:
Looking at the match of record with id 1564, I believe you want to create a 1:1 matching, given that other records satisfy the matching conditions as well.
The closest function I believe you are looking for is pd.merge_asof.
However, it only allows for merging based on one condition. It is not implemented for merging based on two conditions. And this is for good reason, I believe.
For example, why does your expected result match 4533 to 1564, but not 5366 instead, given that both satisfy the merge condition? In this case, you may argue that 1564's values for start_time and cost are both closer to the respective values of 4533 compared to 5366, i.e., 1564 is a "better match".
But, in general, what rule would you want to impose in case the matching condition is satisfied for two candidate records, say A and B, where, for example, record A is a "better match" with respect to condition 1 and record B with respect to 2? You would have to provide more information/rules to write an algorithm.
If you want 1:m matching and you have a decent data size, consider SQLite, I find it to more appropriate for such problems.
