Home > Enterprise >  How do I reindex a MultiIndex with additional Rows for only one Index Level?
How do I reindex a MultiIndex with additional Rows for only one Index Level?

Time:01-05

I have the following dataframe:

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I want to create a dataframe/Multiindex that has a row for every single date that apprears in any of the records. I want to use fill_value=0 for the volumne column.

However I don't want to add any other combinations of Index levels eg. I do not want to add a row for an index with Source SA and Brand BC as there is no known combination of the two columns.

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-01-01     SA       BB        0 # Row to be added. 
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I've done this using windowing without index but it's incredibly slow (this df quite big).

I've tried to do this using this approach: How to reindex a MultiIndex dataframe with a setup that looks like this:

df_dates = df.groupby(['month']).sum() # df is the df with just a range index. 
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.  
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
  new_index_cols,
  names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)

But I realize that that MultiIndex.from_product would show every single combination of all index columns, which I don't want to achieve and which also would need some magnitudes more memory.

The way I see it I could potentially manipulate the full index columns myself by using index.get_level_values(i) and MultiIndex.from_array but I was hoping to find a more trivial process than that.

The process has to be generic as I need to apply it dataframes with different index column values but all share the same volume column and the month column in the index.

CodePudding user response:

When having your dataframe:

                            volume
month       source  brand   
2020-01-01  A       A       5.0
2020-02-01  A       A       10.0
2020-02-01  A       B       5.0

I believe you want a row for each unique combination of month, source, brand?

Does this help?

months = df.index.unique(level=0)
brands = df.index.unique(level=1)
source = df.index.unique(level=2)

df2 = pd.DataFrame(
    index = pd.MultiIndex.from_product(
        [months, source, brands])
    ).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)

This yields:

                                volume
    month       source  brand   
    2020-01-01  A       A       5.0
    2020-01-01  A       B       0.0
    2020-02-01  A       A       10.0
    2020-02-01  A       B       5.0

If you need all months, even when they do not occur anywhere, you could use:

months = pd.date_range(months.min(), months.max(), freq='MS')

CodePudding user response:

I found my solution eventually though it's more complex than what I liked. I added it in a generic function:

def index_fill_missing(df, index_cols, fill_col, fill_value=0):
    """
    Finds all the unique values of the collumn 'fill_col' in df and 
    returns a dataframe with an index based on index_cols   fill_col where the
    a new row is added for any rows where the value in fill_col did not 
    previously exist in the dataframe. 

    The additional values are set to the value of the parameter 'fill_value'

    Parameters: 
    df pandas.DataFrame: the dataframe 
    index_cols list(str): the list of column names to use in the index column
    fill_col (str): the column name for which all values should appear in every 
    single index. 
    fill_value (any): the value to fill the metric columns in new rows. 

    Returns: 
    pandasPdateframe: DataFrame with MultiINdex and additional rows.
    """
    # Get unique values for the fill_col.
    fill_val_list = df[fill_col].unique().tolist()
    # Create a dataframe with the reduced index and get a list of tuples 
    # with the index values. 
    df_i = df.set_index(index_cols)
    df_i_tup = df_i.index.unique().tolist()
    # Append the fill col values to each and every of these index tuples. 
    df_f_tup = []
    col_names = list(index_cols)
    col_names.append(fill_col)
    print(col_names)
    for tup in df_i_tup:
        for fill_val in fill_val_list:
            df_f_tup.append(tup   (fill_val,))
    # Create an index based on these tuples and reindex the dataframe. 
    idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
    print(idx_f)
    # We can only reindex if there are no duplicate values 
    # Hence the groupby with sum function. 
    df_g = df.groupby(by=col_names).sum()
    df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
    return df_f

Creating the sample dataframe:

  '2020-01-01', '2020-02-01', 
  '2020-02-01', 
  '2020-01-01', '2020-01-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
  list(zip(dates, brands, sources, volumes1, volumes2)), 
  columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df

Resulting Output:

        month brand source  volume1  volume2
0  2020-01-01    BA     SA        5        5
1  2020-02-01    BA     SA       10       10
2  2020-02-01    BB     SA        5        5
3  2020-01-01    BC     SB        5        5
4  2020-01-01    BC     SB       10       10

And applying the function:

df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2

Resulting output:

                         volume1  volume2
source brand month                       
SA     BA    2020-01-01        5        5
             2020-02-01       10       10
       BB    2020-01-01        0        0
             2020-02-01        5        5
SB     BC    2020-01-01       15       15
             2020-02-01        0        0
  •  Tags:  
  • Related