I have two pandas DataFrames one with floats and one with corresponding dates. I am trying to 'pivot' the one containing floats df1 using the dates in df2 and sum along the rows if more than one date is occurring.
I have tried methods using Pandas and NumPy both together and separately, the results are correct, however the code can be very slow 1.2ms per loop over the DataFrame. Does anyone know how this same task can be achieved with more speed efficient code so that a 1,000,000 row dataset could be passed through this?
I have got close with this code:
l = []
for i in range(len(df1)):
df_g = pd.DataFrame([df1.loc[i].values], columns=df2.loc[i].values)
k = df_g.groupby(df_g.columns, axis=1).sum()
l.append(k)
df_concatenated = pd.concat(l)
Example Data:
df1 floats
0 1 2 3
0 170.590870 55.003686 18.500252 34.174595
1 170.590870 66.790190 12.377866 34.174595
2 140.922893 49.581099 18.500252 34.174595
3 170.590870 49.581099 18.500252 34.174595
df2
0 1 2 3
0 30/06/2023 31/03/2023 25/10/2022 31/12/2023
1 30/06/2024 31/12/2022 25/10/2022 25/10/2022
2 31/12/2022 31/03/2023 25/10/2022 31/12/2023
3 31/12/2022 31/03/2023 25/10/2022 31/12/2023
Expected Output
25/10/2022 31/12/2022 31/03/2023 30/06/2023 31/12/2023 30/06/2024
0 18.500252 NaN 55.003686 170.590870 34.174595 NaN
1 46.552461 66.79019 NaN NaN NaN 170.59087
2 18.500252 140.922893 49.581099 NaN 34.174595 NaN
3 18.500252 170.59087 49.581099 NaN 34.174595 NaN
CodePudding user response:
use concat
pd.concat([df1, df2], axis=1, keys=['value', 'date'])
output:
value date
0 1 2 3 0 1 2 3
0 170.590870 55.003686 18.500252 34.174595 30/06/2023 31/03/2023 25/10/2022 31/12/2023
1 170.590870 66.790190 12.377866 34.174595 30/06/2024 31/12/2022 25/10/2022 25/10/2022
2 140.922893 49.581099 18.500252 34.174595 31/12/2022 31/03/2023 25/10/2022 31/12/2023
3 170.590870 49.581099 18.500252 34.174595 31/12/2022 31/03/2023 25/10/2022 31/12/2023
unpivot above output and retry pivot
df = pd.concat([df1, df2], axis=1, keys=['value', 'date']).stack().reset_index()
out = (df.pivot_table('value', index='level_0', columns='date', aggfunc='sum')
.rename_axis('').sort_index(axis=1, key=lambda x: pd.to_datetime(x)))
out
date 25/10/2022 31/12/2022 31/03/2023 30/06/2023 31/12/2023 30/06/2024
0 18.500252 NaN 55.003686 170.59087 34.174595 NaN
1 46.552461 66.790190 NaN NaN NaN 170.59087
2 18.500252 140.922893 49.581099 NaN 34.174595 NaN
3 18.500252 170.590870 49.581099 NaN 34.174595 NaN
if dtype of df2 is datetime, don need sort_index
