I have 15 different datasets (a list of 15 pandas.DataFrame) of the same problem which i would like to study from the perspective of a single classifier doing K-Fold CV. Currently, i am running some experiments with the following structure:
# Manual 15-Fold CV
for i in range(len(datasets)):
train_sets = [datasets[j] for j in range(len(datasets)) if j != i]
test_set = datasets[i]
train = pd.concat(train_sets)
clf = ...
clf.fit(...)
...
As you can see, i need to treat each dataset as a fold for K-Fold, instead of simply merging all datasets into a single one and running default cross_val_score() or something similar.
This works great when running singular experiments, but i'd like to use GridSearchCV to better explore my models. So, the question is: is there any way of creating a custom KFold predefining what will be each fold and pass it to GridSearchCV?
CodePudding user response:
from the documentation
cv: int, cross-validation generator or an iterable, default=None
An iterable yielding (train, test) splits as arrays of indices.
so you can create a list of tuples, and merge all the data into a single dataset while maintaining indicies to them that you can use in those tuples,
so if you have the indicies in a list of numpy arrays called indicies.
edit: this is untested but it should work.
indicies = []
train_test_set = []
last_element = 0
for j in range(len(datasets)):
train_test_set.append(datasets[j])
indicies.append(np.arange(last_element,last_element len(datasets[j])))
last_element = len(datasets[j])
cv_list = []
for i in range(15):
cv_train = np.hstack([indicies[x] for x in range(15) if x != i])
cv_list.append((cv_train,indicies[i]))
and just pass cv_list to the function.
Edit2: fixed typo in code.
