I have a dataframe where there are multiple columns that can have the same values (categorical variables), and I'd like to transform these values into a numerical value (binary). I have been trying to use the pd.get_dummies() function to achieve this, but I end up with lots of repetitive columns in the end (e.g. Color1_green and Color2_green).
An example dataframe of my input would be something like:
User Color1 Color2 Color3
0 Username1 green red blue
1 Username2 red blue NaN
2 Username3 green yellow NaN
As you can see, the variables Color1, Color2 and Color3 hold the same possible values, and they won't repeat values (so if Color1 is red, Color2 cannot be red).
What I'm trying to achieve is performing a one-hot encoding on these three color columns in order to get the following result:
User green red blue yellow
0 Username1 1 1 1 0
1 Username2 0 1 1 0
2 Username3 1 0 0 1
Is there some way to this type of one-hot encoding using pandas?
CodePudding user response:
You can stack, get_dummies and aggregate with max
out = df[['User']].join(
pd.get_dummies(df.filter(like='Color').stack())
.groupby(level=0).max()
)
Output:
User blue green red yellow
0 Username1 1 1 1 0
1 Username2 1 0 1 0
2 Username3 0 1 0 1
