I have a dataset that looks like this
Day1 Day2 Day3 Day4 ID Product Price
Day2 Day4 2 X 50
Day1 Day3 Day4 3 Y 60
Day2 Day3 4 Z 70
I want it to be aggregated in this way
Days ID Product Price
Day2&Day4 2 X 50
Day1&Day3&Day4 3 Y 60
Day2&Day3 4 Z 70
I am finding it difficult to crack the logic behind this
Raw input:
df = pd.DataFrame({'Day1': ['', 'Day1', ''],
'Day2': ['Day2', '', 'Day2'],
'Day3': ['', 'Day3', 'Day3'],
'Day4': ['Day4', 'Day4', ''],
'ID': [2, 3, 4],
'Product': ['X', 'Y', 'Z'],
'Price': [50, 60, 70]})
CodePudding user response:
You can convert the relevant columns to a list of lists and use list comprehension where in each iteration, you filter out empty spaces and join the remaining items with & in between. Then assign this list of strings to 'Days' column. Finally, drop the redundant columns and rearrange remaining columns in your preferred order.
cols = ['Day1', 'Day2', 'Day3', 'Day4']
df['Days'] = ['&'.join(x for x in lst if x!='') for lst in df[cols].to_numpy().tolist()]
df = df.drop(cols, axis=1)[['Days','ID','Product','Price']]
Output:
Days ID Product Price
0 Day2&Day4 2 X 50
1 Day1&Day3&Day4 3 Y 60
2 Day2&Day3 4 Z 70
CodePudding user response:
NB. Assuming your empty cells are 'NaN', if not you can use df.replace({'': float('nan')}) as a first step
You could stack and groupby apply('&'.join):
df['Days'] = df.filter(like='Day').stack().groupby(level=0).apply('&'.join)
The stacking operation gets rid of the empty cells
output:
Day1 Day2 Day3 Day4 ID Product Price Days
0 NaN Day2 NaN Day4 2 X 50 Day2&Day4
1 Day1 NaN Day3 Day4 3 Y 6 Day1&Day3&Day4
2 NaN Day2 Day3 NaN 4 Z 70 Day2&Day3
dropping the original DaysX columns:
cols = list(df.filter(like='Day').columns)
df['Days'] = df[cols].stack().groupby(level=0).apply('&'.join)
df.drop(columns=cols)
output:
ID Product Price Days
0 2 X 50 Day2&Day4
1 3 Y 6 Day1&Day3&Day4
2 4 Z 70 Day2&Day3
CodePudding user response:
I have fully automated version. A bit more code lines, but works. I find all columns that startswith "Day" phrase, then concatinate them, then drop the old columns and reorder columns.
cols = df.columns[pd.Series(df.columns).str.startswith('Day')]
df['Days'] = df[cols].apply('&'.join, axis=1).str.strip('&')
df.drop(columns = cols, inplace = True)
col = df.pop("Days")
df.insert(0, col.name, col)
# To clear double '&' chars.
import re
df['Days'] = [re.sub('& ', '&', day) for day in df['Days']]
