Encoding categorical data issue-CodePudding

I was doing a house price regression model and I got the data from Kaggle, and when I try to convert the categorical variables to dummy variables I got something unusual for the training data, I got the shape of (1460, 270)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,4,5,6,7,8,9,10,11,12,13,14,
19,20,21,22,23,25,26,27,28,29,30,31,
33,37,38,39,40,51,53,55,57,60,61,62,72,73
])], remainder='passthrough',sparse_threshold=0)
X = np.array(ct.fit_transform(X))

X.shape
(1460, 270)

when I did the same for the test data, I got the shape of (1459, 254)

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,4,5,6,7,8,9,10,11,12,13,14,
19,20,21,22,23,25,26,27,28,29,30,31,
33,37,38,39,40,51,53,55,57,60,61,62,72,73
])], remainder='passthrough',sparse_threshold=0)
X_actual = np.array(ct.fit_transform(X_actual))

X_actual.shape
(1459, 254)

Why the shape is different after converting the same categorical variables for the train and the test data.

Note that before converting the categorical variables, there was no issue both data have the same shape; that issue only happened when I convert the same categorical variables for both data sets, I got different shapes.

I did a solution, but I do not like it so much, I resize the X_actual.shape to (1459,270), and the algorithm works fine, but I think it affects the algorithm quality too much. And, if it is normal when converting categorical variables for two different data sets, how I can make both data set have the same dimensions?

CodePudding user response：

You are re-fitting OneHotEncoder to the test data X_actual using fit_transform again, so it seems that X_actual has fewer categories (and possibly new ones). Try instead ct.transform(X_actual) to preserve all the categories found in the training data X, and hence the feature shape.

Note that OneHotEncoder, as of the current version of scikit-learn 1.0.2, with the default parameter handle_unknown='error' will raise an error upon encountering new categories in test data. Set handle_unknown='ignore' to ignore the error.