I am currently enrolled in the Google Machine Learning crash course. In a particular section of the course, we are introduced to the practical applications of Linear Regression in python code. Below is the relevant code (full code could be found here):-
my_feature = ([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0])
my_label = ([5.0, 8.8, 9.6, 14.2, 18.8, 19.5, 21.4, 26.8, 28.9, 32.0, 33.8, 38.2])
learning_rate=0.05
epochs=100
my_batch_size= ? # Replace ? with an integer.
my_model = build_model(learning_rate)
trained_weight, trained_bias, epochs, rmse = train_model(my_model, my_feature,
my_label, epochs,
my_batch_size)
plot_the_model(trained_weight, trained_bias, my_feature, my_label)
plot_the_loss_curve(epochs, rmse)
Here presuming all the required libraries are included and functions being defined (not of interest here). In the above code I am having problems understanding the Hyperparameter batch_size. It is described in the ML Wiki as No. of examples in a batch!?. It is related to epochs (iterations?) such that N/Batch_size gives us the number of iterations (can't understand either if batch_size<N).
Out of the three hyperparameters I understand
Learning_rateas the Negative Gradient increment value, directed towards the region of low loss.epochsas the number of times the examples (complete data set) are processedbatch_sizeas the subsection of the example superset
Please Confirm:- An example for the above data set would be {1.0, 5.0}.
Problem:- How exactly are the examples processed when the batch size is lower then N?
P.S.:- batch_size clearly seems to have a big impact on the resulting output, as in the later excercise we are to perform Regression on 17,000 examples. Upon doing it with a batch_size of 30 we get a RMS error% of 100 , but upon doing the same with batch_size of 17000 the RMS error% is 1000 !!
CodePudding user response:
Fom machinelearningmastery.com
The batch size is a number of samples processed before the model is updated. The number of epochs is the number of complete passes through the training dataset.
Let me explain.
As you may have already learned, gradient descent can be used to update the training parameters. But, to get an accurate value of how much to update (to calculate the gradient), the algorithm can look over the errors of multiple samples of data. The size of this selected set of samples is known as the batch size.
As for the problem:
The usual case is that the batch size is lower than N (I'm assuming N=dataset size). Usually, a set of samples of size batch_size is selected randomly from the dataset. The rest of the procedure is the same as when batch_size is equal to N: find the derivative of the errors, then update the training parameters to minimize the error.
As for the effect of large vs small batch size, there's a trade-off. Larger batch size can be useful when training with a GPU. It also converges to the global optimum if the batch_size=N. However, it can lead to overfitting since it would be the best solution for the training set.
On the other hand, a smaller batch size allows the model to generalize better, which probably explains what you observed. The test/validation set was likely quite different from the training set. Additionally, a smaller batch size tends to be faster in general since the model starts to learn before the errors for the entire dataset are calculated.
I'm not sure what you meant by "Please confirm", but feel free to ask if I'm missing anything.
Edit 1: Explanation for when batch_size < N
- We have a line
y = m*x c. - If you have 10 samples and the
batch_sizeis 5, we randomly select 5 samples and get the gradients formandcbased on these 5 samples (I won't be going into the details of how to calculate the gradient. You could take a look here for a better idea on that). - Next, we update m and c based on the learning rate and gradient. Now, one batch has been used for training.
- We have 5 remaining samples. Now, we calculate the gradient from these remaining 5 samples. Note that here we use the updated
mandcto calculate the gradient. - Next, we use that gradient to update
mandcagain.
This is one epoch. As you continue to execute more epochs, the better the line would fit your data.
Edit 2: Explanation for why lower batch size seemed to give better results.
When batch_size = N and you're taking the gradient, it's calculated using the entire dataset at once. So, you can guarantee that no matter what, mathematically, m and c will approach their globally optimal values since it can always see the entire dataset. The problem with this is that this is only with respect to the training data. It could be the case that the test data is significantly different to the training data. (Look up the term "overfitting"). However, when we have a smaller batch size, the model doesn't fit itself to the entire dataset. So it may never reach the global optimum. This can be favorable since it helps the model generalize better to inputs it wasn't trained on.
Another possible cause might be because, with a smaller batch size, the model approaches a good estimate much faster than a larger batch size. This is because the model is updating itself at a higher frequency since it doesn't have to calculate the gradient using the entire dataset. So, if you're looking at the training loss, it would be the case that initially the smaller batch size would have a lower loss but eventually, the higher batch size model may reach an even lower loss.
These are possible reasons for your observation but this does not have to be the case. The large batch size may work better if the training set is a good representation of the entire dataset. Or it may just be that the model needs more training.
