Home > Software design >  collapse similarly prefixed columns in pandas dataframe, convert into row_index
collapse similarly prefixed columns in pandas dataframe, convert into row_index

Time:02-02

In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.

rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
       {'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
        {'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
   id        a_ts  a_energy  a_money        b_ts b_color a_color  b_money
0   1  2020-10-02         6      2.0  2020-10-02    blue     NaN      NaN
1   2  2020-02-02         2      5.0  2012-08-11    blue  orange     10.0
2   3  2011-02-02         4      NaN         NaN     NaN     NaN      NaN

I want my output to look something like this.

               energy  money   color
id ts                               
1  2020-10-02     6.0    2.0    blue
2  2020-02-02     2.0    5.0  orange
   2012-08-11     NaN   10.0    blue
3  2011-02-02     4.0    NaN     NaN

The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.

I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).

df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)

CodePudding user response:

Use:

df = df.set_index(['id'])

df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
   id   color  energy  money          ts
0   1     NaN     6.0    2.0  2020-10-02
1   1    blue     NaN    NaN  2020-10-02
2   2  orange     2.0    5.0  2020-02-02
3   2    blue     NaN   10.0  2012-08-11
4   3     NaN     4.0    NaN  2011-02-02

And then shift values per groups with removed only NaNs rows by custom lambda functions:

f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
                color  energy  money
id ts                               
1  2020-10-02    blue     6.0    2.0
2  2012-08-11    blue     NaN   10.0
   2020-02-02  orange     2.0    5.0
3  2011-02-02     NaN     4.0    NaN
  •  Tags:  
  • Related