I've tried to select an example slice of the type of data I am working with:
df = pd.DataFrame({'date' : pd.to_datetime(['2014-10-02', '2014-10-02', '2014-10-02', '2014-10-02', '2014-11-30', '2014-11-30', '2015-04-02', '2015-04-02', '2015-04-03', '2015-04-03', '2015-04-20', '2015-04-20', '2015-07-05', '2015-07-05', '2021-06-12', '2021-06-14', '2021-06-15', '2021-06-18', '2021-07-06', '2021-07-06', '2021-07-20', '2021-07-20', '2021-07-26', '2021-08-19', '2021-08-20', '2021-12-14', '2021-12-15']),
'amount' : [-1111.11, -1000.0, 1111.11, 1000.0, -1000.0, 1000.0, -193.36, 193.36, -813.02, 813.02, -250.0, 250.0, -100.0, 100.0, 20.0, -4643.53, 4643.53, 20.0, -4762.65, 4762.65, -3729.61, 3729.61, 32.0, -1075.99, 1075.99, -1033.94, 1033.94]})
date amount
2014-10-02 -1111.11
2014-10-02 -1000.00
2014-10-02 1111.11
2014-10-02 1000.00
2014-11-30 -1000.00
2014-11-30 1000.00
2015-04-02 -193.36
2015-04-02 193.36
2015-04-03 -813.02
2015-04-03 813.02
2015-04-20 -250.00
2015-04-20 250.00
2015-07-05 -100.00
2015-07-05 100.00
2021-06-12 20.00
2021-06-14 -4643.53
2021-06-15 4643.53
2021-06-18 20.00
2021-07-06 -4762.65
2021-07-06 4762.65
2021-07-20 -3729.61
2021-07-20 3729.61
2021-07-26 32.00
2021-08-19 -1075.99
2021-08-20 1075.99
2021-12-14 -1033.94
2021-12-15 1033.94
There are many positive and negative pairs of values in the above dataframe that are within 1-2 days of each other. I would like to ignore any of the pairs that within 1-2 days of each other and isolate the non-paired date amounts.
I've had a few goes at it and I can't seem to figure out how best to define a 1-2 day range to group the dates on, but have had some success with:
g = df.groupby([pd.Grouper(key='date', freq='5D'), df.amount.abs()])
g.size()
date amount
2014-10-02 1000.00 2
1111.11 2
2014-11-26 1000.00 2
2015-03-31 193.36 2
813.02 2
2015-04-20 250.00 2
2015-07-04 100.00 2
2021-06-12 20.00 1
4643.53 2
2021-06-17 20.00 1
2021-07-02 4762.65 2
2021-07-17 3729.61 2
2021-07-22 32.00 1
2021-08-16 1075.99 2
2021-12-14 1033.94 2
Although I'm unclear why so many days are needed, but it seems like without a specific offset it is grouping them together from whenever the first date starts.
The best solution I have so far is:
g = df.groupby([df.date.dt.floor('3D'), df.amount.abs()])
g.size()
date amount
2014-10-01 1000.00 2
1111.11 2
2014-11-30 1000.00 2
2015-04-02 193.36 2
813.02 2
2015-04-20 250.00 2
2015-07-04 100.00 2
2021-06-11 20.00 1
2021-06-14 4643.53 2
2021-06-17 20.00 1
2021-07-05 4762.65 2
2021-07-20 3729.61 2
2021-07-26 32.00 1
2021-08-19 1075.99 2
2021-12-14 1033.94 2
but I am not sure it is a robust enough solution and I would like to better understand how to control the width of the range if that makes sense. For all I know these solutions could be poor hacks that work with my current slice.
thanks for your time.
CodePudding user response:
I hope I understood your question correctly.
You groupby absolute value and the diff() in the transform. The diff() is the difference between current and previous:
df["time_diff"] = df.groupby([df.amount.abs()])["date"].transform(lambda x: x.diff())
This assumes the rows are ordered by date, which they appear to be.
df["time_diff"] = df["time_diff"].dt.days
df["abs_amount"] = df["amount"].abs()
df = df.sort_values(by= ["abs_amount", "date"])
df["drop"] = df["time_diff"].lt(3) | df["time_diff"].shift(-1).lt(3)
The drop column is now set to True for all entries that have less than 3 days difference from the entry with the same value.
We needed to sort by abs_value first so there is no confussion. And we used shift(-1) because the diff() didn't to pair differences.
