I have two dataframes loaded from a .csv file. One contains numeric values, the other dates (month-year) for when these numeric values occured. The dates and values are basically mapped to each other. I would like to combine/merge these dataframes to have the dates as the column, and values as the rows. However, as you can see, the dates, though ordered from left to right, they don't all start on the same month.
import pandas as pd
df1 = pd.DataFrame(
[
[1, 2, pd.NA, pd.NA, pd.NA],
[2, 3, 4, pd.NA, pd.NA],
[4, 5, 6, pd.NA, pd.NA],
[5, 6, 12, 14, 15]
]
)
df2 = pd.DataFrame(
[
["2021-01", "2021-02", pd.NA, pd.NA, pd.NA],
["2021-02", "2021-03", "2021-04", pd.NA, pd.NA],
["2022-03", "2022-04", "2022-05", pd.NA, pd.NA],
["2021-04", "2021-05", "2021-06", "2021-07", "2021-08"]
]
)
df1
df2
Although I managed to create the combined dataframe, the dataframes, df1 and df2 contain ~300k rows, and the approach I thought of is rather slow. Is there a more efficient way of achieving the same result?
q = {z: {x: y for x, y in zip(df2.values[z], df1.values[z]) if not pd.isna(y)} for z in range(len(df2))}
df = pd.DataFrame.from_dict(q, orient='index')
idx = pd.to_datetime(df.columns, errors='coerce', format='%Y-%m').argsort()
df.iloc[:, idx]
df3 (result)
CodePudding user response:
You can stack, concat and pivot:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.reset_index(level=0)
.pivot(index='level_0', columns=1, values=0)
.rename_axis(index=None, columns=None)
)
Alternative with unstack:
(pd.concat([df1.stack(), df2.stack()], axis=1)
.droplevel(1).set_index(1, append=True)
[0].unstack(1)
.rename_axis(columns=None)
)
output:
2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 2022-03 2022-04 2022-05
0 1 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
3 NaN NaN NaN 5 6 12 14 15 NaN NaN NaN
CodePudding user response:
Use concat with keys parameters, so possible after DataFrame.stack and convert MutiIndex to column use DataFrame.pivot:
df = (pd.concat([df1, df2], axis=1, keys=['a','b'])
.stack()
.reset_index()
.pivot('level_0','b','a'))
print (df)
b 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 \
level_0
0 1 2 NaN NaN NaN NaN NaN NaN
1 NaN 2 3 4 NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN 5 6 12 14 15
b 2022-03 2022-04 2022-05
level_0
0 NaN NaN NaN
1 NaN NaN NaN
2 4 5 6
3 NaN NaN NaN



