why do we drop target/label before splitting data into test and train?-CodePudding

why do we drop target/label before splitting data into test and train? for example in code below

X = df.drop('Scaled sound pressure level',axis=1)
y = df['Scaled sound pressure level']

split the data

from sklearn.model_selection import train_test_split

80/20 split by fixing the seed to reproduce the results

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 2021)

CodePudding user response：

Actually, It's not compulsory to do that. You can give the whole dataframe with target values and the function will return train df and test df. You can then retrieve independent and dependent columns. This will work fine for regression datasets.

For classification tasks also this can work. But we want an equal spread of target classes in both train and test sets. Hence, we need to give target values to the 'stratify' parameter of the train_test_split() method.