This is an extended version of the same question: Why does adding random numbers not break this custom loss function?
Can someone explain to me why these models both get a very good AUC, even though the loss in the second one should be extremely compromised due to adding random numbers? Since the loss functions work correctly I assume it has something to do with how I ensure reproducibility by resetting the seeds or with some theoretical problem I don't understand.
priv is an additional input I need for my full custom loss function and not necessary for this example. Also, my full loss function does not work with eager execution so I have to disable it.
opt = tf.keras.optimizers.Adam(learning_rate=1e-04)
def binary_crossentropy1(y_true, y_pred):
bin_cross = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce1 = K.mean(bin_cross(y_true, y_pred))
return bce1
def binary_crossentropy2(y_true, y_pred):
bin_cross = tf.keras.losses.BinaryCrossentropy(from_logits=False)
bce2 = K.mean(bin_cross(y_true, y_pred)) tf.random.normal([], mean=0.0, stddev=10.0)
return bce2
def reset_random_seeds():
os.environ['PYTHONHASHSEED']=str(1)
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
#model 1
reset_random_seeds()
input1 = keras.Input(shape=(9,))
priv = keras.Input(shape=(1,))
x = layers.Dense(12, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(input1)
x = layers.Dense(8, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
output = layers.Dense(1, activation="sigmoid", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
model1 = keras.Model(inputs=[input1, priv], outputs=output)
model1.compile(optimizer=opt, loss=binary_crossentropy1)
model1.fit(x=[X_train, priv_train], y=y_train_float, epochs=10, batch_size = 32)
model1_pred = model1.predict([X_test,priv_test])
print(model1_pred)
#model 2
reset_random_seeds()
input1 = keras.Input(shape=(9,))
priv = keras.Input(shape=(1,))
x = layers.Dense(12, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(input1)
x = layers.Dense(8, activation="relu", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
output = layers.Dense(1, activation="sigmoid", kernel_initializer=keras.initializers.glorot_uniform(seed=123))(x)
model2 = keras.Model(inputs=[input1, priv], outputs=output)
model2.compile(optimizer=opt, loss=binary_crossentropy2)
model2.fit(x=[X_train, priv_train], y=y_train_float, epochs=10, batch_size = 32)
model2_pred = model2.predict([X_test,priv_test])
print(model2_pred)
CodePudding user response:
You've been taught that loss controls model training. That's an incomplete story. What really controls model training is the gradients. When you add a random number to the loss, it doesn't affect the gradient.
Imagine a mountain with a slope of 30 degrees. A ball rolling down the mountain will roll down the mountain.
Now imagine pushing that entire mountain up by 10 feet. The ball still rolls in the same direction.
That's the intuition for you here.
The place that this applies that first comes to mind is the equivalence of KL divergence and crossentropy as a loss, despite being different equations. The difference is just a constant and hence, the two produce the same gradients. This suggests that there's a connections between crossentropy, which comes from Frequentist Maximum Likelihood Estimation, and KL Divergence which is normally more of an information theory concept.
