Problem statement
I am trying to turn a dataframe with a table that crosses categories with items at different dates into a dataframe with dates in the index and multiindex columns with categories as first level and items as the second level. Below you can find an example of original and target tables.
Is there a pythonic way to achieve this in Pandas?
Input table
- All of the table is in the dataFrame data. Indices and columns are just integers*
| Category A | Category B | Category C | |
|---|---|---|---|
| 31/12/2020 | |||
| item one | 10 | 100 | 1000 |
| item two | 20 | 200 | 2000 |
| 31/12/2021 | |||
| item one | 30 | 300 | 3000 |
| item two | 40 | 400 | 4000 |
Desired output table
Dates should be in the index and columns should be a multiindex of Category and item
| Category A | Category B | Category C | ||||
|---|---|---|---|---|---|---|
| item one | item two | item one | item two | item one | item two | |
| 31/12/2020 | 10 | 20 | 100 | 200 | 1000 | 2000 |
| 31/12/2021 | 30 | 40 | 300 | 400 | 3000 | 4000 |
Python code to create input dataframe
import pandas as pd
input_df = pd.DataFrame(
index = [0,1,2,3,4,5,6],
columns = [0,1,2,3],
data = [
["","Category A","Category B","Category C"],
["31/12/2020","","",""],
["item one",10,100,1000],
["item two",20,200,2000],
["31/12/2021","","",""],
["item one",30,300,3000],
["item two",40,400,4000],
]
)
CodePudding user response:
It's a complex reshape with a central pivot. I used the ability to be converted to date to identify the groups of rows:
out = (input_df
.T.set_index(0).T
.assign(date=lambda d: pd.to_datetime(d[''], errors='coerce').ffill())
.loc[lambda d: pd.to_datetime(d[''], errors='coerce').ne(d['date'])]
.pivot(index='date', columns='')
.rename_axis(index=None, columns=(None, None))
)
Output:
Category A Category B Category C
item one item two item one item two item one item two
2020-12-31 10 20 100 200 1000 2000
2021-12-31 30 40 300 400 3000 4000
CodePudding user response:
Here's another way. Create a dummy Series from the first column depending on if it contains '/' or not (helps identify dates). Then create groups depending on dates. Then create a mapper that maps groups to dates using groupby_first and use it to create a "date" column. Finally, select relevant rows and unstack to get the desired shape:
input_df = input_df.rename(columns=input_df.iloc[0]).drop(input_df.index[0])
check_for_items = input_df[''].str.contains(r'/')
input_df['date'] = check_for_items.cumsum()
input_df['date'] = input_df['date'].map(input_df.groupby('date')[''].first())
out = input_df[~check_for_items].set_index(['date','']).unstack().reset_index()
Output:
date Category A Category B Category C
item one item two item one item two item one item two
0 31/12/2020 10 20 100 200 1000 2000
1 31/12/2021 30 40 300 400 3000 4000
