I have the following dataframe:
df dataframe:
item date_buy date_sell profit window
1 shoes 2009-12-04 2021-08-14 0.22 10
2 shoes 2009-12-05 2010-09-19 1.5 10
3 shoes 2015-05-05 2020-15-15 7.3 10
4 shoes 2009-12-09 2021-08-14 0.82 4
5 shoes 2009-12-10 2010-09-20 4.5 4
6 shoes 2015-05-11 2020-15-16 1.8 4
7 hat 2009-12-04 2021-08-14 1.2 10
8 hat 2009-12-05 2010-09-19 2.25 10
9 hat 2015-05-05 2020-15-15 4.3 10
10 hat 2009-12-09 2021-08-14 3.2 4
11 hat 2009-12-10 2010-09-20 9.4 4
12 hat 2015-05-11 2020-15-16 1.8 4
What I need to do is to resample the data until today using data_buy as a key and separating the data by item and window. What I did is grouping my data by item and window, for each group I add the an extra row exactly as the last of the group changing only data_buy field with today date and then resample but the execution is extremely slow since I have several thousands of data.
this is my code:
data = data.set_index(pd.to_datetime(data ['date_buy']))
resampled_data = data.groupby(['item', 'window']).apply(lambda x: resample(x, now())
def resample(df, today):
df = pd.concat([df, df[df.index==df.index.max()].rename(index={df.index.max(): pd.to_datetime(today)})])
df = df.asfreq('B', method='ffill')
return df
the result is correct and is the following (it's similar for the item hat):
df dataframe:
item date_buy date_sell profit window
1 shoes 2009-12-04 2021-08-14 0.22 10
2 shoes 2009-12-05 2010-09-19 1.5 10
.
2 shoes 2015-05-04 2010-09-19 1.5 10
3 shoes 2015-05-05 2020-15-15 7.3 10
.
.
3 shoes 2022-09-15 2020-15-15 7.3 10
4 shoes 2009-12-09 2021-08-14 0.82 4
5 shoes 2009-12-10 2010-09-20 4.5 4
.
5 shoes 2015-05-10 2010-09-20 4.5 4
6 shoes 2015-05-11 2020-15-16 1.8 4
.
.
6 shoes 2022-09-15 2020-15-16 1.8 4
This snippet takes about 30s to execute and I wanted to make it faster. Am I missing some pandas best practice to make it faster?
CodePudding user response:
This will work I think, although it will still not be super performant. Adding today's date each time seems to me like something that can't be vectorised (I may be wrong though). But using concat is expensive, replacing it with reindex is definitely going to gain you some performance.
data = data.set_index(pd.to_datetime(data['date_buy']))
def my_resample(grp):
return (grp
.reindex(grp.index.tolist() [pd.Timestamp.today().floor('D')])
.fillna(method='ffill')
.asfreq('B', method='ffill'))
(data
.groupby(['item', 'window'])
.apply(my_resample)
.reset_index(level=2)
.reset_index(drop=True)
)
CodePudding user response:
I might have a solution without .apply:
First step - Create a dataframe end_data that contains the closing date_buy entries for each item-window group:
today = pd.Timestamp.today().floor('D')
end_data = (
data
.groupby(['item', 'window'], as_index=False)
.agg({'date_buy': lambda c: today})
)
For your sample that looks like:
item window date_buy
0 hat 4 2022-09-15
1 hat 10 2022-09-15
2 shoes 4 2022-09-15
3 shoes 10 2022-09-15
Second step:
data['date_buy'] = pd.to_datetime(data['date_buy']) # Just in case
data = (
pd.concat([data, end_data])
.set_index('date_buy', drop=True).sort_index()
.groupby(['item', 'window'], as_index=False).resample('B').ffill()
.fillna(method='ffill')
.droplevel(0).reset_index()
)
- Cast column
date_buyintodatetimes (might already be the case). - Append
end_dataat the end ofdata. - Use column
date_buyas index (drop the column), and then sort the index. The sorting is only necessary if theitem-windowblocks ofdate_buyaren't already in ascending order. - Now group the result by
item-window,.resample('B')on the groups to upsample according to your requirement, and use.ffillon the results. - Then fill the remaining
NaN/NaTs via forward fill. - Finally drop the first index level, and reset the upsampled
date_buy-index as a column.
The result for your sample looks like:
date_buy item date_sell profit window
0 2009-12-09 hat 2021-08-14 3.2 4
1 2009-12-10 hat 2010-09-20 9.4 4
2 2009-12-11 hat 2010-09-20 9.4 4
3 2009-12-14 hat 2010-09-20 9.4 4
4 2009-12-15 hat 2010-09-20 9.4 4
... ... ... ... ... ...
13329 2022-09-09 shoes 2020-15-15 7.3 10
13330 2022-09-12 shoes 2020-15-15 7.3 10
13331 2022-09-13 shoes 2020-15-15 7.3 10
13332 2022-09-14 shoes 2020-15-15 7.3 10
13333 2022-09-15 shoes 2020-15-15 7.3 10
[13334 rows x 5 columns]
(Column date_sell contains invalid dates.)
