I'm going through the 2nd edition chapter 10 notebook and working on the "Training and

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Are loss/accuracy ratios a better early stopping condition than actual lack of improvement? about handson-ml2 HOT 5 CLOSED

ageron commented on May 18, 2024

Are loss/accuracy ratios a better early stopping condition than actual lack of improvement?

from handson-ml2.

Comments (5)

ageron commented on May 18, 2024

Hi @gkd720 ,
Thanks for your question! This difference may be due to the fact the "sgd" optimizer used to have a default learning rate of 1e-3, but this was changed recently to 1e-2 (presumably to match the default in the multi-backend Keras implementation at keras.io).
So please use optimizer=keras.optimizers.SGD(lr=1e-3) instead of "sgd", and I'm guessing everything will fall back into place.

from handson-ml2.

gkd720 commented on May 18, 2024

Yep. That was it. Thanks. Got to 88.25% training accuracy, 87.9% validation accuracy, and 86.6% on the test set after 70 epochs. Both validation and training loss still decreasing pretty much monotonically, without any hyper parameter tuning, and with a val/train loss ration at 1.04, so no overfitting suspected yet, and accuracies still might drift up somewhat.
But back to my early stopping criteria questions above: What do the cool data scientists use (actual improvement amounts, loss ratio, accuracy ratio, etc.)? Or do they just eyeball an accuracy, and think "good enough"? Thanks again.

from handson-ml2.

ageron commented on May 18, 2024

Thanks for your feedback, glad to know that setting the learning rate to 1e-3 fixed the discrepancy. Regarding your early stopping question, I'm not sure about cool data scientists, but I think the rest of us use the validation metric as the criterion: we're interested in how well the system will perform on new instances, and the validation metric is a fairly good way to estimate this, so if it stops improving, we should stop training.
Comparing the training loss and the validation loss is useful to know how bad overfitting is. If the spread is large, you don't just want to stop training, you also want to fix the problem, for example by training on more data, or regularizing the model, etc. Which implies retraining.
Hope this helps!

from handson-ml2.

gkd720 commented on May 18, 2024

In your "If the spread is large", does that mean a large ratio, or just a large difference?
Anyway, I went ahead and did Chapter 10 Exercise 10 on the MNIST digits dataset (Doh! I now suspect you meant the fashion dataset. Oh well, it was a good exercise and I wanted to try hyper parameter tuning.) Trying some typical ranges, it ran for almost 2 hours on my late 2012 iMac with a 3.4 GHz Intel Core i7 and 16GB memory running at 800% (4 HT physical cores) which quickly kicked in the fan. The best model achieved 98.01% (WooHoo! . . . unless it's not really that hard for the digits vs. the fashion set). Is it possible to see live results in the Jupyter notebook during the tuning run? I see the results flying by in my terminal window but most of it scrolled out of the display buffer. I did see a warning about output going to stderr before some kind of "flag" assignment, so I guess I would have to explicitly write them to a file? I did see an occasional model summary fly by, so I guess writing from the build_model function can work. Would assigning the fit results to a "history" variable have forced the output to Jupyter? Could they be seen live, or only after all experimentation was completed? My concern is how to detect overfitting before all the runs are done. Since the test accuracy is so close to the train/valid results, should I even care if later evaluation suggests overfitting (0.0652/0.0046 == too big?)?

Code/results below for review, or as another data point in the solution space. Thanks.

Update (6/21/2019): I reran this exercise with the MNIST digits and fashion datasets, but I get the same exact model/params when done! Is this possible? And now, the test score for the digits run is 17% !! I checked my data assignment/manipulation, and the only difference is which keras.dataset I use (mnist vs. fashion_mnist). What might I be doing wrong? Thanks.

Update (6/27/2019): OK, I suspected it was my fault, and it looks like it was, but I can't figure out how I messed up. Anyway, rerunning both fashion and digits tuning runs at the same time, with less combinations (less n_iters and epochs), results in much more typical results: fashion - 86% and digits - 96%. I still need to look into getting live data during the runs, so I'll experiment with "history".

X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.

def build_model(n_hidden=1, n_neurons=100, learning_rate=3e-3, input_shape=[28,28]):
    model = keras.models.Sequential()
    model.add(keras.layers.Flatten(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu"))
        n_neurons=n_neurons//3     # drop by an integral third each successive layer
    model.add(keras.layers.Dense(10,activation="softmax"))     # We know last level must pick 1 of 10 digits
    optimizer = keras.optimizers.SGD(learning_rate)
    model.summary()
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer=optimizer,
                  metrics=["accuracy"])     # Need this for early stopping.
    return model

keras_clf = keras.wrappers.scikit_learn.KerasClassifier(build_model)

from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV

param_distribs = {
    "n_hidden": [1, 2, 3],
    "n_neurons": np.arange(100, 500),
    "learning_rate": reciprocal(3e-4, 3e-2),
}

np.random.seed(42)
tf.random.set_seed(42)
# May have to Kernel -> Reconnect to see all results.
rnd_search_cv = RandomizedSearchCV(keras_clf, param_distribs, n_iter=20, cv=3, verbose=2, n_jobs=-1)
rnd_search_cv.fit(X_train, y_train, epochs=100,
                  validation_data=(X_valid, y_valid),
                  callbacks=[keras.callbacks.EarlyStopping(patience=10)])

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 54.9min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 105.8min finished

. . .
Epoch 31/100
55000/55000 [==============================] - 3s 63us/sample - loss: 0.0046 - accuracy: 0.9998 - val_loss: 0.0652 - val_accuracy: 0.9820
. . .

model=rnd_search_cv.best_estimator_.model
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_4 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 341)               267685    
_________________________________________________________________
dense_12 (Dense)             (None, 113)               38646     
_________________________________________________________________
dense_13 (Dense)             (None, 10)                1140      
=================================================================
Total params: 307,471
Trainable params: 307,471
Non-trainable params: 0


rnd_search_cv.best_params_
Out[54]:
{'learning_rate': 0.02298924804076755, 'n_hidden': 2, 'n_neurons': 341}


model.evaluate(X_train, y_train)
55000/55000 [==============================] - 2s 31us/sample - loss: 0.0041 - accuracy: 0.9999
Out[56]:
[0.004096633061076599, 0.99985456]


model.evaluate(X_valid, y_valid)
5000/5000 [==============================] - 0s 31us/sample - loss: 0.0652 - accuracy: 0.9820
Out[57]:
[0.06520095226885751, 0.982]

model.evaluate(X_test, y_test)
10000/10000 [==============================] - 0s 33us/sample - loss: 0.0688 - accuracy: 0.9801
Out[59]:
[0.06882534091388516, 0.9801]

from handson-ml2.

ageron commented on May 18, 2024

Sorry for the late response.

98% accuracy on MNIST is good. With convolutional neural nets, you can go beyond 99% accuracy, and with data augmentation, ensembling, learning rate schedule, and so on, you can reach 99.7%.

To get live results in Jupyter, you probably should use the TensorBoard callback when you call the model.fit() method, and use the %tensorboard magic command to view progress.

Regarding your question about detecting overfitting: if you train your model and get good performance on the validation set, you might not care that there's a bit of overfitting. So usually what happens is you are unhappy about the validation performance, and you try to understand what happens, and the gap between the train metric and the validation metric is so great that you conclude that it's an overfitting problem, so you regularize your model or you find more training data or you reduce the size of your neural network (less layers, less neurons...), and so on. I've never tried to automate the interruption of training based on an overfitting criterion, but I guess you could. I think I would wait until the training metric reaches some threshold, and I would then stop training if and when the valid/train metric ratio reaches some other threshold. Not sure that's great, it's my first hunch. However, if we interrupt training while the validation metric is still improving, we run the risk of throwing away a model that would have been fantastic, despite some overfitting. What matters in the end is the generalization performance. I would prefer a model with 90% accuracy on the test set and 100% on the training set, versus a model with 80% accuracy on the test set and 85% on the training set. The first one overfits more than the second, but it's still much better.

Hope this helps.

from handson-ml2.

Are loss/accuracy ratios a better early stopping condition than actual lack of improvement? about handson-ml2 HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent