I have 2 files test.csv and train.csv. The attribute values are categorical and I am trying to convert them into numerical values.
I am doing the following:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols = 'column_name' , return_df = True)
x_train_data = encoder.fit_transform(x_train_data)
This resulted in a new table with a total of 13 columns.
After that, I am training my DecisionTreeClassifier on x_train_data and y_train_data
Finally, I want to predict the Labels in test.csv.
If I repeat the BinaryEncoding procedure again on the test.csv, this time it is resulting in < 13 features which I think is due to a lesser number of rows.
Due to the difference in total columns, the decision tree classifier won't work.
So, is there a way to predict? And if not then what is the point of Binary Encoder? Since I assume we train a model so that we can predict on an unknown dataset.
CodePudding user response:
You just do the transform() on the test data (and do not fit the encoder again). The values that don't occur in "training" datasets would be encoded as 0 in all of the categories (as long as you won't change the handle_unknown parameter). For example:
import category_encoders as ce
train = pd.DataFrame({"var1": ["A", "B", "A", "B", "C"], "var2":["A", "A", "A", "A", "B"]})
encoder = ce.BinaryEncoder(cols = ['var1', 'var2'] , return_df = True)
x_train_data = encoder.fit_transform(train)
# var1_0 var1_1 var2_0 var2_1
#0 0 1 0 1
#1 1 0 0 1
#2 0 1 0 1
#3 1 0 0 1
#4 1 1 1 0
test = pd.DataFrame({"var1": ["C", "D", "B"], "var2":["A", "C", "F"]})
x_test_data = encoder.transform(test)
# var1_0 var1_1 var2_0 var2_1
#0 1 1 0 1
#1 0 0 0 0
#2 1 0 0 0
'D' doesn't occur in var1 in training data, so it was encoded as 0 0. 'C' and 'F'don't occur in var2 in training data, so they were both encoded as 0 0.
