Hi there I am working on a machine learning algorithm - I have applied random forest classifier.
I have split my data into 3 sets train, test and validation. When i evaluate with validation i get f1, recall, precision and accuracy of nearly 1 but when i do with my test (not involved in training) I get around 0.5 for all the metrics.
I have tried to under/over sample - I've applied KFold but these have not helped with my problem, I also have tried Naive Bayes and Linear regression but these only make it worse.
Does anyone have an suggests how I can improve this score or would this be acceptable.
The aim of the model is the classify class 0 or 1.
Thanks
CodePudding user response:
The problem normally occurs when you have overfitted the validation set. Although you have trained using the training set, the best model will be the one with the best performance on the validation set. In that case, you have overfitted the validation set. Another problem might be the difference in the distribution of the test set among the others. Have you shuffled the data before splitting them? If the splits are already provided there is nothing much you can do, assuming you have done the training properly.
CodePudding user response:
You are obviously overfitting, but this can also come from bad model choice for task, or biased data. If you have poor data, your model might struggle to actually learn the relationship between features. You might get a good accuracy during training, but if you get a significant decrease in accuracy when testing your model (from 90% accuracy during training to below 60% is pretty bad for instance) this means that you are overfitting.
I would suggest to take a closer look at your data first, try to plot everything and check if there is anything weird that you can see. Don't forget to remove the target (y) from X. Also check the other features you have, some might be 'useless' for the model and just make the learning process harder. You might want to make sure you have 'balanced and clean' data as well. If you notice heavy inbalance in the dataset, try to resample it to get at least (assuming your y variable contains 2 values) a 80 to 20% distribution ratio.
