I have a grouped dataframe like so:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ NaN apple banana │ NaN fruits fruits
basket 2 │ almond carrot NaN │ nuts veggies NaN
One row represents a "basket" containing different food products (vegtables, fruits, nuts).
Each basket has 3 spots that can either contain a food product or not (=NaN).
I would like the first column of group product to be as populated as possible. That means if there is a NaN value in the first column of the product group and some value in the 2nd or n-th column if should shift to the left for each group.
Categories are related: in the example above a baskets' spot1 of group product and spot1 of group category belong together. Every data combination must have a value for product. If product is NaN then all the related items will be NaN as well.
The output should look something like:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ apple banana NaN │ fruits fruits NaN <-- this row shifted to left to "fill" first spot of product group
basket 2 │ almond carrot NaN │ nuts veggies NaN
jezrael's answer here was a good starting point for me:
#for each row remove NaNs and create new Series - rows in final df
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
However, this solution ignores grouping. I only want values to shift left based on the specific group product.
edit / minimal reproducible example
please use this code to get to the "starting point" of problem. The way I get to this point in my production code is more complex but this should do fine.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1, 'NaN','NaN'], [1, 'apple','fruits'], [1,'banana', 'fruits'], [2, 'carrot','veggies'], [2, 'almond','nuts']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['basket','product', 'category'])
# print dataframe.
df
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() 1]).first().unstack().reset_index()
print(dfg)
CodePudding user response:
I trust there is an easier way to accomplish this, but the following should work.
Setup
import pandas as pd
import numpy as np
# make sure that `NaNs` are *actual* `NaNs`, e.g. not "NaN" (this is a string
# like any other); or use `[1, None, None]`
data = [[1, np.nan,np.nan],
[1, 'apple','fruits'],
[1,'banana', 'fruits'],
[2, 'carrot','veggies'],
[2, 'almond','nuts']]
df = pd.DataFrame(data, columns=['basket','product', 'category'])
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount()
1]).first().unstack().reset_index()
print(dfg)
basket product category
1 2 3 1 2 3
0 1 None apple banana None fruits fruits
1 2 carrot almond NaN veggies nuts NaN
Code
dfg.set_index('basket',drop=True, inplace=True)
out = dfg.unstack().dropna()
out.index = pd.MultiIndex.from_arrays([
out.index.get_level_values(0),
out.groupby(level=[2,0]).cumcount().add(1).to_numpy(),
out.index.get_level_values(2)])
out = out.reset_index(drop=False).pivot(index='basket',
columns=['level_0','level_1'],
values=0)\
.reindex(dfg.columns, axis='columns').reset_index(drop=False)
print(out)
basket product category
1 2 3 1 2 3
0 1 apple banana NaN fruits fruits NaN
1 2 carrot almond NaN veggies nuts NaN
Explanation
- First, we use
df.unstack()withSeries.dropnato get a Series with aMultiIndexthat consists ofcol level 0,col level 1,index. I.e.:
out = df.unstack().dropna()
print(out.head(4))
basket
product 1 2 carrot
2 1 apple
2 almond
3 1 banana
- Next, we use
df.groupbyon levels0,2(i.e. originalcol level 0andbasket), and we usecumcountto get consecutive numbers for the items in each group (adding 1:add(1). I.e. we are doing:
print(out.groupby(level=[2,0]).cumcount().add(1).to_numpy())
[1 1 2 2 1 1 2 2]
- We use this result inside
pd.MultiIndex.from_arraysto overwrite theMultiIndex(specificallylevel 1) with a new index. I.e. we now have:
print(out.head(4))
basket
product 1 2 carrot
1 apple
2 2 almond
1 banana
- Now, finally, we can reset the index and use
df.pivotto change the shape ofoutso that it matches the shape of the originaldf. Chainingdf.reindexapplied to the columns will both reset the order of the columns and add all the missing columns (e.g.3atcol level 1for both values incol level 0, and they will be automatically filled with NaNs.
