Resourse Exhaust Error on Kaggle when using image size of (350, 300) in TensorFlow-CodePudding

I have images stored in a '/train' folder and labels are in train.csv file. I am loading the data like this:

training_percentage = 0.8
training_item_count = int(len(train) * training_percentage)
validation_item_count = len(train)-int(len(train) * training_percentage)
training_df = train[:training_item_count]
validation_df = train[training_item_count:]

batch_size = 64
image_height = 350
image_width = 300
input_shape = (image_height, image_width, 3)
dropout_rate = 0.4
classes_to_predict = sorted(training_df.label.unique())

training_data = tf.data.Dataset.from_tensor_slices((training_df.file_name.values, training_df.label.values))
validation_data = tf.data.Dataset.from_tensor_slices((validation_df.file_name.values, validation_df.label.values))

def load_image_and_label_from_path(image_path, label):
    img = tf.io.read_file(image_path)
    img = tf.io.decode_jpeg(img)
    img = tf.image.convert_image_dtype(img, tf.float32)
    
    return img, label

AUTOTUNE = tf.data.experimental.AUTOTUNE

training_data = training_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)
validation_data = validation_data.map(load_image_and_label_from_path, num_parallel_calls = AUTOTUNE)

training_data_batches = training_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)
validation_data_batches = validation_data.shuffle(buffer_size = 500).batch(batch_size).prefetch(buffer_size = AUTOTUNE)

I think batch size of 64 is good, there should not be any error. Because image size is not that big (350, 300). Why am I getting this error?:

ResourceExhaustedError:  OOM when allocating tensor with shape[64,192,88,75] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node model_4/efficientnetb4/block3a_expand_activation/Sigmoid (defined at tmp/ipykernel_34/3030206641.py:4) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_620332]

Function call stack:
train_function

This is how I am training my model - Click Here

CodePudding user response：

Your HW does not have enough memory to allocate your data. Among others, you have the following options:

decrease the batch size
distribute your training among several workers
use a different training HW. e.g. different GPU or change to TPU
Use different model architecture

CodePudding user response：

with a 350 X 300 X 3 image you have 315,000 pixels that are float values that will consume a lot of memory per image. Plus you will have pretty high training times. So if you can reduce the image size to something like 250 X 215. Next reduce your batch size to say 20. A lot depends on the model you are sing as well. So try the values I suggested. If it runs then you can incrementally increase the image size.