Home > database >  Reshaping dataframe from categories cross-table to one with multi-index in the columns
Reshaping dataframe from categories cross-table to one with multi-index in the columns

Time:02-02

Problem statement

I am trying to turn a dataframe with a table that crosses categories with items at different dates into a dataframe with dates in the index and multiindex columns with categories as first level and items as the second level. Below you can find an example of original and target tables.

Is there a pythonic way to achieve this in Pandas?

Input table

  • All of the table is in the dataFrame data. Indices and columns are just integers*
Category A Category B Category C
31/12/2020
item one 10 100 1000
item two 20 200 2000
31/12/2021
item one 30 300 3000
item two 40 400 4000

Desired output table

Dates should be in the index and columns should be a multiindex of Category and item

Category A Category B Category C
item one item two item one item two item one item two
31/12/2020 10 20 100 200 1000 2000
31/12/2021 30 40 300 400 3000 4000

Python code to create input dataframe

import pandas as pd

input_df = pd.DataFrame(
              index = [0,1,2,3,4,5,6],
              columns = [0,1,2,3],
              data = [
                 ["","Category A","Category B","Category C"],
                 ["31/12/2020","","",""],
                 ["item one",10,100,1000],
                 ["item two",20,200,2000],
                 ["31/12/2021","","",""],
                 ["item one",30,300,3000],
                 ["item two",40,400,4000],
              ]
           )

CodePudding user response:

It's a complex reshape with a central pivot. I used the ability to be converted to date to identify the groups of rows:

out = (input_df
 .T.set_index(0).T
 .assign(date=lambda d: pd.to_datetime(d[''], errors='coerce').ffill())
 .loc[lambda d: pd.to_datetime(d[''], errors='coerce').ne(d['date'])]
 .pivot(index='date', columns='')
 .rename_axis(index=None, columns=(None, None))
)

Output:

           Category A          Category B          Category C         
             item one item two   item one item two   item one item two
2020-12-31         10       20        100      200       1000     2000
2021-12-31         30       40        300      400       3000     4000

CodePudding user response:

Here's another way. Create a dummy Series from the first column depending on if it contains '/' or not (helps identify dates). Then create groups depending on dates. Then create a mapper that maps groups to dates using groupby_first and use it to create a "date" column. Finally, select relevant rows and unstack to get the desired shape:

input_df = input_df.rename(columns=input_df.iloc[0]).drop(input_df.index[0])
check_for_items = input_df[''].str.contains(r'/')
input_df['date'] = check_for_items.cumsum()
input_df['date'] = input_df['date'].map(input_df.groupby('date')[''].first())
out = input_df[~check_for_items].set_index(['date','']).unstack().reset_index()

Output:

         date Category A          Category B          Category C         
                item one item two   item one item two   item one item two
0  31/12/2020         10       20        100      200       1000     2000
1  31/12/2021         30       40        300      400       3000     4000
  •  Tags:  
  • Related