I have the following dataframe:
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
I want to create a dataframe/Multiindex that has a row for every single date that apprears in any of the records. I want to use fill_value=0 for the volumne column.
However I don't want to add any other combinations of Index levels eg. I do not want to add a row for an index with Source SA and Brand BC as there is no known combination of the two columns.
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-01-01 SA BB 0 # Row to be added.
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
I've done this using windowing without index but it's incredibly slow (this df quite big).
I've tried to do this using this approach: How to reindex a MultiIndex dataframe with a setup that looks like this:
df_dates = df.groupby(['month']).sum() # df is the df with just a range index.
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
new_index_cols,
names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)
But I realize that that MultiIndex.from_product would show every single combination of all index columns, which I don't want to achieve and which also would need some magnitudes more memory.
The way I see it I could potentially manipulate the full index columns myself by using index.get_level_values(i) and MultiIndex.from_array but I was hoping to find a more trivial process than that.
The process has to be generic as I need to apply it dataframes with different index column values but all share the same volume column and the month column in the index.
CodePudding user response:
When having your dataframe:
volume
month source brand
2020-01-01 A A 5.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
I believe you want a row for each unique combination of month, source, brand?
Does this help?
months = df.index.unique(level=0)
brands = df.index.unique(level=1)
source = df.index.unique(level=2)
df2 = pd.DataFrame(
index = pd.MultiIndex.from_product(
[months, source, brands])
).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)
This yields:
volume
month source brand
2020-01-01 A A 5.0
2020-01-01 A B 0.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
If you need all months, even when they do not occur anywhere, you could use:
months = pd.date_range(months.min(), months.max(), freq='MS')
CodePudding user response:
I found my solution eventually though it's more complex than what I liked. I added it in a generic function:
def index_fill_missing(df, index_cols, fill_col, fill_value=0):
"""
Finds all the unique values of the collumn 'fill_col' in df and
returns a dataframe with an index based on index_cols fill_col where the
a new row is added for any rows where the value in fill_col did not
previously exist in the dataframe.
The additional values are set to the value of the parameter 'fill_value'
Parameters:
df pandas.DataFrame: the dataframe
index_cols list(str): the list of column names to use in the index column
fill_col (str): the column name for which all values should appear in every
single index.
fill_value (any): the value to fill the metric columns in new rows.
Returns:
pandasPdateframe: DataFrame with MultiINdex and additional rows.
"""
# Get unique values for the fill_col.
fill_val_list = df[fill_col].unique().tolist()
# Create a dataframe with the reduced index and get a list of tuples
# with the index values.
df_i = df.set_index(index_cols)
df_i_tup = df_i.index.unique().tolist()
# Append the fill col values to each and every of these index tuples.
df_f_tup = []
col_names = list(index_cols)
col_names.append(fill_col)
print(col_names)
for tup in df_i_tup:
for fill_val in fill_val_list:
df_f_tup.append(tup (fill_val,))
# Create an index based on these tuples and reindex the dataframe.
idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
print(idx_f)
# We can only reindex if there are no duplicate values
# Hence the groupby with sum function.
df_g = df.groupby(by=col_names).sum()
df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
return df_f
Creating the sample dataframe:
'2020-01-01', '2020-02-01',
'2020-02-01',
'2020-01-01', '2020-01-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
list(zip(dates, brands, sources, volumes1, volumes2)),
columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df
Resulting Output:
month brand source volume1 volume2
0 2020-01-01 BA SA 5 5
1 2020-02-01 BA SA 10 10
2 2020-02-01 BB SA 5 5
3 2020-01-01 BC SB 5 5
4 2020-01-01 BC SB 10 10
And applying the function:
df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2
Resulting output:
volume1 volume2
source brand month
SA BA 2020-01-01 5 5
2020-02-01 10 10
BB 2020-01-01 0 0
2020-02-01 5 5
SB BC 2020-01-01 15 15
2020-02-01 0 0
