In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.
rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
{'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
{'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
id a_ts a_energy a_money b_ts b_color a_color b_money
0 1 2020-10-02 6 2.0 2020-10-02 blue NaN NaN
1 2 2020-02-02 2 5.0 2012-08-11 blue orange 10.0
2 3 2011-02-02 4 NaN NaN NaN NaN NaN
I want my output to look something like this.
energy money color
id ts
1 2020-10-02 6.0 2.0 blue
2 2020-02-02 2.0 5.0 orange
2012-08-11 NaN 10.0 blue
3 2011-02-02 4.0 NaN NaN
The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.
I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).
df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)
CodePudding user response:
Use:
df = df.set_index(['id'])
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
id color energy money ts
0 1 NaN 6.0 2.0 2020-10-02
1 1 blue NaN NaN 2020-10-02
2 2 orange 2.0 5.0 2020-02-02
3 2 blue NaN 10.0 2012-08-11
4 3 NaN 4.0 NaN 2011-02-02
And then shift values per groups with removed only NaNs rows by custom lambda functions:
f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
color energy money
id ts
1 2020-10-02 blue 6.0 2.0
2 2012-08-11 blue NaN 10.0
2020-02-02 orange 2.0 5.0
3 2011-02-02 NaN 4.0 NaN
