I am working on multi-label classification problem where my large-scale data is highly imbalanced. So, I need to apply stratified sampling with the intuition that my ImageDataGenerator proportionally sample data from each class in every batch. Any suggestion/solution will be highly appreciated.
CodePudding user response:
Good question indeed.
To the best of my knowledge, there is no built-in multi-label stratification in ImageDataGenerator().
I will suggest two possible approaches:
You could subclass a
Sequence()class in order to be able to control exactly what you feed at each step in the network. You could override the__getitem__()method, and ensure that the batch is sampled proportionally in each batch.You could use an external library which preprocesses your data before you feed it to your network. In this way, you could preprocess the data and use
tf.data.Dataset()pipeline to feed the data to your network.
An example for (2) is this one:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)
for train_index, test_index in mskf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
