Here is a dummy DataFrame of my data, I have categorical rows (represented by the existence of NaN value of 'Price') and data rows (represented by a non-NaN
value of 'Price').
gear = [('Baseball', None), ('Bat', 1), ('Glove', 2), ('Soccer', None), ('Shoes', 3), ('Ball', 4), ('Football', None), ('Helmet', 6)]
dummy_df = pd.DataFrame(gear, columns=['Name', 'Price'])
Name Price
0 Baseball NaN
1 Bat 1.0
2 Glove 2.0
3 Soccer NaN
4 Shoes 3.0
5 Ball 4.0
6 Football NaN
7 Helmet 6.0
I would like to create a new column 'Sport' which is applied to each row under the category of course until you reach the next sport. The resulting DataFrame would look like this with the categorical rows removed:
Name Price Sport
1 Bat 1.0 Baseball
2 Glove 2.0 Baseball
3 Shoes 3.0 Soccer
4 Ball 4.0 Soccer
5 Helmet 6.0 Football
I was thinking of creating a new column 'Sport' which is the value of Name if Price is not NaN otherwise NaN. Then use a ffill or something and then drop the NaN price rows?
CodePudding user response:
Try mask the notna, then ffill to get the correct Sport:
s = dummy_df['Price'].notna()
dummy_df.assign(Sport=dummy_df['Name'].mask(s).ffill()).loc[s]
Output:
Name Price Sport
1 Bat 1.0 Baseball
2 Glove 2.0 Baseball
4 Shoes 3.0 Soccer
5 Ball 4.0 Soccer
7 Helmet 6.0 Football
CodePudding user response:
dummy_df["Sport"] = dummy_df.groupby(dummy_df.Price.isna().cumsum()).Name.transform("first")
dummy_df[dummy_df.Price.notna()]
# Name Price Sport
# 1 Bat 1.0 Baseball
# 2 Glove 2.0 Baseball
# 4 Shoes 3.0 Soccer
# 5 Ball 4.0 Soccer
# 7 Helmet 6.0 Football
