Massive overfit during resnet50 transfer learning

10,722

Solution 1

You need to use the preprocessing_function argument in ImageDataGenerator.

 train_datagen = ImageDataGenerator(preprocessing_function=keras.applications.resnet50.preprocess_input)

This will ensure that your images are pre-processed as expected for the pre-trained network you are using.

Solution 2

I implemented various architectures for transfer learning and observed that models containing BatchNorm layers (e.g. Inception, ResNet, MobileNet) perform a lot worse (~30 % compared to >95 % test accuracy) during evaluation (validation/test) than models without BatchNorm layers (e.g. VGG) on my custom dataset. Furthermore, this problem does not occurr when saving bottleneck features and using them for classification. There are already a few blog entries, forum threads, issues and pull requests on this topic and it turns out that the BatchNorm layer doesn't use the new dataset's statistics but the original dataset's (ImageNet) statistics when frozen:

Assume you are building a Computer Vision model but you don’t have enough data, so you decide to use one of the pre-trained CNNs of Keras and fine-tune it. Unfortunately, by doing so you get no guarantees that the mean and variance of your new dataset inside the BN layers will be similar to the ones of the original dataset. Remember that at the moment, during training your network will always use the mini-batch statistics either the BN layer is frozen or not; also during inference you will use the previously learned statistics of the frozen BN layers. As a result, if you fine-tune the top layers, their weights will be adjusted to the mean/variance of the new dataset. Nevertheless, during inference they will receive data which are scaled differently because the mean/variance of the original dataset will be used.

cited from http://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/

A workaround is to first freeze all layers and then unfreeze all BatchNormalization layers to make them use the new dataset's statistics instead of the original statistics:

# build model
input_tensor = Input(shape=train_generator.image_shape)
base_model = inception_v3.InceptionV3(input_tensor=input_tensor,
                                      include_top=False,
                                      weights='imagenet',
                                      pooling='avg')
x = base_model.output

# freeze all layers in the base model
base_model.trainable = False

# un-freeze the BatchNorm layers
for layer in base_model.layers:
    if "BatchNormalization" in layer.__class__.__name__:
        layer.trainable = True

# add custom layers
x = Dense(1024, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(train_generator.num_classes, activation='softmax')(x)

# define new model
model = Model(inputs=input_tensor, outputs=x)

This also explains the difference in performance between training the model with frozen layers and evaluate it with a validation/test set and saving bottleneck features (with model.predict the internal backend flag set_learning_phase is set to 0) and training a classifier on the cached bottleneck features.

More information here:

Pull request to change this behavior (not-accepted): https://github.com/keras-team/keras/pull/9965

Similar thread: https://datascience.stackexchange.com/questions/47966/over-fitting-in-transfer-learning-with-small-dataset/72436#72436

Solution 3

Have you got any work around of your problem? If not then this might be an issue with batch norm layer in your resnet. I have also faced similar kind of issue as in keras batch norm layer behave very differently during training and testing. So you can freeze all bn layers by:

BatchNorm()(training=False)

and then try to retrain your network again on the same data set. one more thing you should keep in mind that during training you should set training flag as

import keras.backend as K K.set_learning_phase(1)

and during testing set this flag to 0. I think it should work after making above changes.

If you have found any other solution of the problem please post it here so that others can get benefit of that.

Thank you.

Solution 4

I am also working on a very small dataset and encountered the same problem of validation accuracy being stuck at some point although the training accuracy keeps going higher. I also noticed that my validation loss was getting higher as well over time. FYI, I am using Resnet 50 and InceptionV3 models.

After some digging on the internet, I found a discussion on github taking place which connects this problem to the implementation of Batch Normalization layers in Keras. The above mentioned problem is encountered when applying transfer learning and fine-tuning the network. I am not sure if you have the same problem, but I have added the link below to Github where you can read more about this problem, and try to apply some tests which will help you in understanding if you are affected by the same problem.

Github link to the pull request and discussion

Share:
10,722
morienor
Author by

morienor

Updated on June 06, 2022

Comments

  • morienor
    morienor almost 2 years

    This is my first attempt at doing something with CNNs, so I am probably doing something very stupid - but can't figure out where I am wrong...

    The model seems to be learning fine, but the validation accuracy is not improving (ever - even after the first epoch), and validation loss is actually increasing with time. It doesn't look like I am overfiting (after 1 epoch?) - must we off in some other way.

    typical network behaviour

    I am training a CNN network - I have ~100k images of various plants (1000 classes) and want to fine-tune ResNet50 to create a muticlass classifier. Images are of various sizes, I load them like so:

    from keras.preprocessing import image                  
    
    def path_to_tensor(img_path):
        # loads RGB image as PIL.Image.Image type
        img = image.load_img(img_path, target_size=(IMG_HEIGHT, IMG_HEIGHT))
        # convert PIL.Image.Image type to 3D tensor with shape (IMG_HEIGHT, IMG_HEIGHT, 3)
        x = image.img_to_array(img)
        # convert 3D tensor to 4D tensor with shape (1, IMG_HEIGHT, IMG_HEIGHT, 3) and return 4D tensor
        return np.expand_dims(x, axis=0)
    
    def paths_to_tensor(img_paths):
        list_of_tensors = [path_to_tensor(img_path) for img_path in img_paths] #can use tqdm(img_paths) for data
        return np.vstack(list_of_tensors)enter code here
    

    The database is large (does not fit into memory) and had to create my own generator to provide both reading from the disk and augmentation. (I know Keras has .flow_from_directory() - but my data is not structured this way - it is just a dump of 100k images mixed with 100k metadata files). I probably should have created a script to structure them better and not create my own generators, but the problem is likely somewhere else.

    The generator version below doesn't do any augmentation for the time being - just rescaling:

    def generate_batches_from_train_folder(images_to_read, labels, batchsize = BATCH_SIZE):    
    
        #Generator that returns batches of images ('xs') and labels ('ys') from the train folder
        #:param string filepath: Full filepath of files to read - this needs to be a list of image files
        #:param np.array: list of all labels for the images_to_read - those need to be one-hot-encoded
        #:param int batchsize: Size of the batches that should be generated.
        #:return: (ndarray, ndarray) (xs, ys): Yields a tuple which contains a full batch of images and labels. 
    
        dimensions = (BATCH_SIZE, IMG_HEIGHT, IMG_HEIGHT, 3)
    
        train_datagen = ImageDataGenerator(
            rescale=1./255,
            #rotation_range=20,
            #zoom_range=0.2, 
            #fill_mode='nearest',
            #horizontal_flip=True
        )
    
        # needs to be on a infinite loop for the generator to work
        while 1:
            filesize = len(images_to_read)
    
            # count how many entries we have read
            n_entries = 0
            # as long as we haven't read all entries from the file: keep reading
            while n_entries < (filesize - batchsize):
    
                # start the next batch at index 0
                # create numpy arrays of input data (features) 
                # - this is already shaped as a tensor (output of the support function paths_to_tensor)
                xs = paths_to_tensor(images_to_read[n_entries : n_entries + batchsize])
    
                # and label info. Contains 1000 labels in my case for each possible plant species
                ys = labels[n_entries : n_entries + batchsize]
    
                # we have read one more batch from this file
                n_entries += batchsize
    
                #perform online augmentation on the xs and ys
                augmented_generator = train_datagen.flow(xs, ys, batch_size = batchsize)
    
            yield  next(augmented_generator)
    

    This is how I define my model:

    def get_model():
    
        # define the model
        base_net = ResNet50(input_shape=DIMENSIONS, weights='imagenet', include_top=False)
    
        # Freeze the layers which you don't want to train. Here I am freezing all of them
        for layer in base_net.layers:
            layer.trainable = False
    
        x = base_net.output
    
        #for resnet50
        x = Flatten()(x)
        x = Dense(512, activation="relu")(x)
        x = Dropout(0.5)(x)
        x = Dense(1000, activation='softmax', name='predictions')(x)
    
        model = Model(inputs=base_net.input, outputs=x)
    
        # compile the model 
        model.compile(
            loss='categorical_crossentropy',
            optimizer=optimizers.Adam(1e-3),
            metrics=['acc'])
    
        return model
    

    So, as a result I have 1,562,088 trainable parameters for roughly 70k images

    I then use a 5-fold cross validation, but the model doesn't work on any of the folds, so I will not be including the full code here, the relevant bit is this:

    trial_fold = temp_model.fit_generator(
                    train_generator,
                    steps_per_epoch = len(X_train_path) // BATCH_SIZE,
                    epochs = 50,
                    verbose = 1,
                    validation_data = (xs_v,ys_v),#valid_generator,
                    #validation_steps= len(X_valid_path) // BATCH_SIZE,
                    callbacks = callbacks,
                    shuffle=True)
    

    I have done various things - made sure my generator is actually working, tried to play with the last few layers of the network by reducing the size of the fully connected layer, tried augmentation - nothing helps...

    I don't think the number of parameters in the network is too large - I know other people have done pretty much the same thing and got accuracy closer to 0.5, but my models seem to be overfitting like crazy. Any ideas on how to tackle this will be much appreciated!

    Update 1:

    I have decided to stop reinventing stuff and sorted by files to work with .flow_from_directory() procedure. To make sure I am importing the right format (triggered by the Ioannis Nasios comment below) - I made sure to the preprocessing_unit() from keras's resnet50 application.

    I also decided to check out if the model is actually producing something useful - I computed botleneck features for my dataset and then used a random forest to predict the classes. It did work and I got accuracy of around 0.4

    So, I guess I definitely had a problem with an input format of my images. As a next step, I will fine-tune the model (with a new top layer) to see if the problem remains...

    Update 2:

    I think the problem was with image preprocessing. I ended up not fine tuning in the end and just extracted botleneck layer and training linear_SVC() - got accuracy of around 60% of train and around 45% of test datasets.

  • morienor
    morienor about 6 years
    I know it would be too small for a training a new network from scratch. But I thought the 1000 images per class rule does not apply it transfer learning? When I used augmentation the accuracy in the train sample was not increasing as fast, but the validation was still stuck at the same accuracy levels ~0.008.
  • morienor
    morienor almost 6 years
    I thought so as well for a few weeks and gave up, but then I found a paper that was using the same dataset and (older) CNN. They did not have the same problem and were able to achieve multiclass accuracy of around 0.6 - there is clearly a problem somewhere with my implementation, not the approach itself. Here is a link: ceur-ws.org/Vol-1391/121-CR.pdf
  • thefifthjack005
    thefifthjack005 almost 6 years
    @morienor i am downloading the dataset and will try to run it using your implementation.
  • Hamid K
    Hamid K about 5 years
    I believe your answer is correct, and I have tested it. One thing which I do not understand is: I loaded the resent model with all weights without top FC layers and set training to False for all layers and batch normalization layers, and then I added my FC layers on top and let the model train (I set the learning phase to 1). My question is what do you mean by during testing set the learning phase to 0. So If I save my model and load it and ask for prediction for my test set, why do I need to set the learning_phase? is it because of BatchNormalization layer?
  • Ankit Dixit
    Ankit Dixit about 5 years
    BatchNormalization and Dropout are the two layers which changes behavior during training. So to remind keras its better to set this flag for both the cases.
  • Japesh Methuku
    Japesh Methuku almost 4 years
    Hi, can you please help me share your notebook where you applied these changes to ResNet50. That would be of great help. After following up with all the available resources with regards to issues on GitHub, keras, I couldn't really understand the information about inference mode and setting BatchNorm to True/False. I think you understand how painful it is. I request you to please help me with this. Thank you.
  • CMCDragonkai
    CMCDragonkai almost 4 years
    If we use BatchNormalization()(x, training=False), and later set the layer to be l.trainable = False, does that still ensure that the layer runs in inference mode (while remaining frozen?).
  • CMCDragonkai
    CMCDragonkai almost 4 years
    This ended up working but this solution is different from how TF2.0 ends up solving the problem by forcing batch norm into inference mode when it is frozen.