pandas groupby a given plus/minus date spread/range on matching absolute values of a column-CodePudding

I've tried to select an example slice of the type of data I am working with:

df = pd.DataFrame({'date' : pd.to_datetime(['2014-10-02', '2014-10-02', '2014-10-02', '2014-10-02', '2014-11-30', '2014-11-30', '2015-04-02', '2015-04-02', '2015-04-03', '2015-04-03', '2015-04-20', '2015-04-20', '2015-07-05', '2015-07-05', '2021-06-12', '2021-06-14', '2021-06-15', '2021-06-18', '2021-07-06', '2021-07-06', '2021-07-20', '2021-07-20', '2021-07-26', '2021-08-19', '2021-08-20', '2021-12-14', '2021-12-15']),
                   'amount' : [-1111.11, -1000.0, 1111.11, 1000.0, -1000.0, 1000.0, -193.36, 193.36, -813.02, 813.02, -250.0, 250.0, -100.0, 100.0, 20.0, -4643.53, 4643.53, 20.0, -4762.65, 4762.65, -3729.61, 3729.61, 32.0, -1075.99, 1075.99, -1033.94, 1033.94]})

      date   amount
2014-10-02 -1111.11
2014-10-02 -1000.00
2014-10-02  1111.11
2014-10-02  1000.00
2014-11-30 -1000.00
2014-11-30  1000.00
2015-04-02  -193.36
2015-04-02   193.36
2015-04-03  -813.02
2015-04-03   813.02
2015-04-20  -250.00
2015-04-20   250.00
2015-07-05  -100.00
2015-07-05   100.00
2021-06-12    20.00
2021-06-14 -4643.53
2021-06-15  4643.53
2021-06-18    20.00
2021-07-06 -4762.65
2021-07-06  4762.65
2021-07-20 -3729.61
2021-07-20  3729.61
2021-07-26    32.00
2021-08-19 -1075.99
2021-08-20  1075.99
2021-12-14 -1033.94
2021-12-15  1033.94

There are many positive and negative pairs of values in the above dataframe that are within 1-2 days of each other. I would like to ignore any of the pairs that within 1-2 days of each other and isolate the non-paired date amounts.

I've had a few goes at it and I can't seem to figure out how best to define a 1-2 day range to group the dates on, but have had some success with:

g = df.groupby([pd.Grouper(key='date', freq='5D'), df.amount.abs()])
g.size()

date        amount 
2014-10-02  1000.00    2
            1111.11    2
2014-11-26  1000.00    2
2015-03-31  193.36     2
            813.02     2
2015-04-20  250.00     2
2015-07-04  100.00     2
2021-06-12  20.00      1
            4643.53    2
2021-06-17  20.00      1
2021-07-02  4762.65    2
2021-07-17  3729.61    2
2021-07-22  32.00      1
2021-08-16  1075.99    2
2021-12-14  1033.94    2

Although I'm unclear why so many days are needed, but it seems like without a specific offset it is grouping them together from whenever the first date starts.

The best solution I have so far is:

g = df.groupby([df.date.dt.floor('3D'), df.amount.abs()])
g.size()

date        amount 
2014-10-01  1000.00    2
            1111.11    2
2014-11-30  1000.00    2
2015-04-02  193.36     2
            813.02     2
2015-04-20  250.00     2
2015-07-04  100.00     2
2021-06-11  20.00      1
2021-06-14  4643.53    2
2021-06-17  20.00      1
2021-07-05  4762.65    2
2021-07-20  3729.61    2
2021-07-26  32.00      1
2021-08-19  1075.99    2
2021-12-14  1033.94    2

but I am not sure it is a robust enough solution and I would like to better understand how to control the width of the range if that makes sense. For all I know these solutions could be poor hacks that work with my current slice.

thanks for your time.

CodePudding user response：

I hope I understood your question correctly.

You groupby absolute value and the diff() in the transform. The diff() is the difference between current and previous:

df["time_diff"] = df.groupby([df.amount.abs()])["date"].transform(lambda x: x.diff())

This assumes the rows are ordered by date, which they appear to be.

df["time_diff"] = df["time_diff"].dt.days

df["abs_amount"] = df["amount"].abs()
df = df.sort_values(by= ["abs_amount", "date"])
df["drop"] = df["time_diff"].lt(3) | df["time_diff"].shift(-1).lt(3)

The drop column is now set to True for all entries that have less than 3 days difference from the entry with the same value.

We needed to sort by abs_value first so there is no confussion. And we used shift(-1) because the diff() didn't to pair differences.