How do I reindex a MultiIndex with additional Rows for only one Index Level?-CodePudding

I have the following dataframe:

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I want to create a dataframe/Multiindex that has a row for every single date that apprears in any of the records. I want to use fill_value=0 for the volumne column.

However I don't want to add any other combinations of Index levels eg. I do not want to add a row for an index with Source SA and Brand BC as there is no known combination of the two columns.

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-01-01     SA       BB        0 # Row to be added. 
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I've done this using windowing without index but it's incredibly slow (this df quite big).

I've tried to do this using this approach: How to reindex a MultiIndex dataframe with a setup that looks like this:

df_dates = df.groupby(['month']).sum() # df is the df with just a range index. 
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.  
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
  new_index_cols,
  names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)

But I realize that that MultiIndex.from_product would show every single combination of all index columns, which I don't want to achieve and which also would need some magnitudes more memory.

The way I see it I could potentially manipulate the full index columns myself by using index.get_level_values(i) and MultiIndex.from_array but I was hoping to find a more trivial process than that.

The process has to be generic as I need to apply it dataframes with different index column values but all share the same volume column and the month column in the index.

CodePudding user response：

When having your dataframe:

                            volume
month       source  brand   
2020-01-01  A       A       5.0
2020-02-01  A       A       10.0
2020-02-01  A       B       5.0

I believe you want a row for each unique combination of month, source, brand?

Does this help?

months = df.index.unique(level=0)
brands = df.index.unique(level=1)
source = df.index.unique(level=2)

df2 = pd.DataFrame(
    index = pd.MultiIndex.from_product(
        [months, source, brands])
    ).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)

This yields:

                                volume
    month       source  brand   
    2020-01-01  A       A       5.0
    2020-01-01  A       B       0.0
    2020-02-01  A       A       10.0
    2020-02-01  A       B       5.0

If you need all months, even when they do not occur anywhere, you could use:

months = pd.date_range(months.min(), months.max(), freq='MS')

CodePudding user response：

I found my solution eventually though it's more complex than what I liked. I added it in a generic function:

def index_fill_missing(df, index_cols, fill_col, fill_value=0):
    """
    Finds all the unique values of the collumn 'fill_col' in df and 
    returns a dataframe with an index based on index_cols   fill_col where the
    a new row is added for any rows where the value in fill_col did not 
    previously exist in the dataframe. 

    The additional values are set to the value of the parameter 'fill_value'

    Parameters: 
    df pandas.DataFrame: the dataframe 
    index_cols list(str): the list of column names to use in the index column
    fill_col (str): the column name for which all values should appear in every 
    single index. 
    fill_value (any): the value to fill the metric columns in new rows. 

    Returns: 
    pandasPdateframe: DataFrame with MultiINdex and additional rows.
    """
    # Get unique values for the fill_col.
    fill_val_list = df[fill_col].unique().tolist()
    # Create a dataframe with the reduced index and get a list of tuples 
    # with the index values. 
    df_i = df.set_index(index_cols)
    df_i_tup = df_i.index.unique().tolist()
    # Append the fill col values to each and every of these index tuples. 
    df_f_tup = []
    col_names = list(index_cols)
    col_names.append(fill_col)
    print(col_names)
    for tup in df_i_tup:
        for fill_val in fill_val_list:
            df_f_tup.append(tup   (fill_val,))
    # Create an index based on these tuples and reindex the dataframe. 
    idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
    print(idx_f)
    # We can only reindex if there are no duplicate values 
    # Hence the groupby with sum function. 
    df_g = df.groupby(by=col_names).sum()
    df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
    return df_f

Creating the sample dataframe:

  '2020-01-01', '2020-02-01', 
  '2020-02-01', 
  '2020-01-01', '2020-01-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
  list(zip(dates, brands, sources, volumes1, volumes2)), 
  columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df

Resulting Output:

        month brand source  volume1  volume2
0  2020-01-01    BA     SA        5        5
1  2020-02-01    BA     SA       10       10
2  2020-02-01    BB     SA        5        5
3  2020-01-01    BC     SB        5        5
4  2020-01-01    BC     SB       10       10

And applying the function:

df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2

Resulting output:

                         volume1  volume2
source brand month                       
SA     BA    2020-01-01        5        5
             2020-02-01       10       10
       BB    2020-01-01        0        0
             2020-02-01        5        5
SB     BC    2020-01-01       15       15
             2020-02-01        0        0