Home > database >  How to use Binary Encoding of Categorical Columns to predict labels in Python?
How to use Binary Encoding of Categorical Columns to predict labels in Python?

Time:01-28

I have 2 files test.csv and train.csv. The attribute values are categorical and I am trying to convert them into numerical values.

I am doing the following:

import category_encoders as ce
encoder = ce.BinaryEncoder(cols = 'column_name' , return_df = True)
x_train_data = encoder.fit_transform(x_train_data)

This resulted in a new table with a total of 13 columns. After that, I am training my DecisionTreeClassifier on x_train_data and y_train_data

Finally, I want to predict the Labels in test.csv. If I repeat the BinaryEncoding procedure again on the test.csv, this time it is resulting in < 13 features which I think is due to a lesser number of rows. Due to the difference in total columns, the decision tree classifier won't work.

So, is there a way to predict? And if not then what is the point of Binary Encoder? Since I assume we train a model so that we can predict on an unknown dataset.

CodePudding user response:

You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:

import category_encoders as ce

train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})

encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)

#   var1_0  var1_1  var2_0  var2_1
#0  0       1       0       1
#1  1       0       0       1
#2  0       1       0       1
#3  1       0       0       1
#4  1       1       1       0

test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)

#   var1_0  var1_1  var2_0  var2_1
#0  1       1       0       1
#1  0       0       0       0
#2  1       0       0       0

'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.

  •  Tags:  
  • Related