Home > Software engineering >  Trouble changing imputer strategy in scikit-learn pipeline
Trouble changing imputer strategy in scikit-learn pipeline

Time:01-27

I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.

First, I have a data preparation pipeline for numerical and categorical columns-

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline

num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
                         OneHotEncoder(sparse=False, handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

Next, I have created a pipeline to train a support vector machine model with feature selection.

from sklearn.feature_selection import SelectFromModel

model = Pipeline([
    ("preprocess", preprocessing),
    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])

Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV

grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
        ['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

This is where I am getting the error. The full pipeline looks like this -

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['longitude', 'latitude',
                                                   'housing_median_age',
                                                   'total_rooms',
                                                   'total_bedrooms',
                                                   'population', 'households',
                                                   'median_income']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='NA',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['ocean_proximity'])])),
                ('feature_select',
                 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
                ('regressor', SVR(C=30000.0, gamma=0.3))])

Can anyone tell me what I need to change in the grid search to make it work?

CodePudding user response:

The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like

gird = {
    "preprocess__num__simpleimputer__strategy":['median']
}

simpleimputer is simply the name that was automatically assigned by make_pipeline.

However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.

  •  Tags:  
  • Related