Using OrdinalEnconder() to transform columns with predefined numerical values-CodePudding

I have a dataframe like this:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
                   'salary': ["low", "medium", "medium", "high", "high"],
                   'tenure': [5,6,6,8,5],
                  })
df


   department  salary  tenure
0  operations     low       5
1  operations  medium       6
2     support  medium       6
3      logics    high       8
4       sales    high       5

I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.

however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.

oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df

    department  salary  tenure
0   operations  1.0     5
1   operations  2.0     6
2   support     2.0     6
3   logics      0.0     8
4   sales       0.0     5

I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.

CodePudding user response：

You can just stay with pandas factorize

df['new'] = df.salary.factorize()[0]
#Out[276]: array([0, 1, 1, 2, 2], dtype=int64)

CodePudding user response：

As @BENY says, you can stay in pandas and do what you want. factorize is great if "low" appears first, "medium" second and "high" third in the data (as shown in your example). If that's not the case, factorize may not produce what you want.

A possible solution is to create a dictionary that maps salary levels to numbers and use map:

mapper = dict([['low', 1], ['medium', 2], ['high', 3]])
df['salary'] = df['salary'].map(mapper)

Output:

   department  salary  tenure
0  operations       1       5
1  operations       2       6
2     support       2       6
3      logics       3       8
4       sales       3       5

CodePudding user response：

If you want to perform this operation using OrdinalEncoder, you can use the categories parameter to specify the ordering.

As follows:

OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])

Output:

array([[0.],
       [1.],
       [1.],
       [2.],
       [2.]])