I have a dataframe like this:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
'salary': ["low", "medium", "medium", "high", "high"],
'tenure': [5,6,6,8,5],
})
df
department salary tenure
0 operations low 5
1 operations medium 6
2 support medium 6
3 logics high 8
4 sales high 5
I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.
however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.
oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df
department salary tenure
0 operations 1.0 5
1 operations 2.0 6
2 support 2.0 6
3 logics 0.0 8
4 sales 0.0 5
I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.
CodePudding user response:
You can just stay with pandas factorize
df['new'] = df.salary.factorize()[0]
#Out[276]: array([0, 1, 1, 2, 2], dtype=int64)
CodePudding user response:
As @BENY says, you can stay in pandas and do what you want. factorize is great if "low" appears first, "medium" second and "high" third in the data (as shown in your example). If that's not the case, factorize may not produce what you want.
A possible solution is to create a dictionary that maps salary levels to numbers and use map:
mapper = dict([['low', 1], ['medium', 2], ['high', 3]])
df['salary'] = df['salary'].map(mapper)
Output:
department salary tenure
0 operations 1 5
1 operations 2 6
2 support 2 6
3 logics 3 8
4 sales 3 5
CodePudding user response:
If you want to perform this operation using OrdinalEncoder, you can use the categories parameter to specify the ordering.
As follows:
OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])
Output:
array([[0.],
[1.],
[1.],
[2.],
[2.]])
