The context is, I am using the caret library with a data set to train and predict using different models. What is the difference between setting the seed at the beginning of an R script or in each of the training and prediction processes?
Thanks Manel
CodePudding user response:
Setting the seed makes the following RNG outputs repeatable. So if something strange happens and you want to debug it, setting the seed lets you see it happen again. Doing it once at the beginning means you'll have to repeat the whole sequence, doing it several times means you need to repeat less. So during debugging, it may make sense to set the seed just before the part you want to examine.
On the other hand, many statistical methods assume independence of results. If you want to generate 1000 random numbers, you only want to set the seed once at the beginning, and the RNG will approximate independence after that. Setting the seed to different values each time is probably okay, but most RNGs are tested assuming the seed is only set once, so you may discover a pattern of seeds that makes results invalid if you set it more than once. So for final results, you should only set the seed at the start.
As @RobertLong said, that also makes it more convenient to change the seed, to see if your results repeat with a different seed.
CodePudding user response:
What is the difference between setting the seed at the beginning of an R script or in each of the training and prediction processes?
The results will be slightly different, but both will be valid.
I tend to always use just one set.seed(), generally near the top of the script. The downside of setting the seed in multiple places is that it is unnecessary, and if you want to change the seed, then you have to do so in several places.
