TensorFlow - regularization with L2 loss, how to apply to all weights, not just last one?

76,216

Solution 1

hidden_weights, hidden_biases, out_weights, and out_biases are all the model parameters that you are creating. You can add L2 regularization to ALL these parameters as follows :

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(hidden_biases) +
    0.01*tf.nn.l2_loss(out_weights) +
    0.01*tf.nn.l2_loss(out_biases))

With the note of @Keight Johnson, to not regularize the bias:

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(out_weights) +

Solution 2

A shorter and scalable way of doing this would be ;

vars   = tf.trainable_variables() 
lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars ]) * 0.001

This basically sums the l2_loss of all your trainable variables. You could also make a dictionary where you specify only the variables you want to add to your cost and use the second line above. Then you can add lossL2 with your softmax cross entropy value in order to calculate your total loss.

Edit : As mentioned by Piotr Dabkowski, the code above will also regularise biases. This can be avoided by adding an if statement in the second line ;

lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars
                    if 'bias' not in v.name ]) * 0.001

This can be used to exclude other variables.

Solution 3

In fact, we usually do not regularize bias terms (intercepts). So, I go for:

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(out_weights))

By penalizing the intercept term, as the intercept is added to y values, it will result in changing the y values, adding a constant c to the intercepts. Having it or not will not change the results but takes some computations

Share:
76,216
Maksim Khaitovich
Author by

Maksim Khaitovich

Analytics Manger at Kearney. Applying AI and ML for business problems and managerial decision making. My main areas of interest are artificial neural networks, data science, decision science and operations research

Updated on March 20, 2020

Comments

  • Maksim Khaitovich
    Maksim Khaitovich about 4 years

    I am playing with a ANN which is part of Udacity DeepLearning course.

    I have an assignment which involves introducing generalization to the network with one hidden ReLU layer using L2 loss. I wonder how to properly introduce it so that ALL weights are penalized, not only weights of the output layer.

    Code for network without generalization is at the bottom of the post (code to actually run the training is out of the scope of the question).

    Obvious way of introducing the L2 is to replace the loss calculation with something like this (if beta is 0.01):

    loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels) + 0.01*tf.nn.l2_loss(out_weights))
    

    But in such case it will take into account values of output layer's weights. I am not sure, how do we properly penalize the weights which come INTO the hidden ReLU layer. Is it needed at all or introducing penalization of output layer will somehow keep the hidden weights in check also?

    #some importing
    from __future__ import print_function
    import numpy as np
    import tensorflow as tf
    from six.moves import cPickle as pickle
    from six.moves import range
    
    #loading data
    pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle'
    
    with open(pickle_file, 'rb') as f:
      save = pickle.load(f)
      train_dataset = save['train_dataset']
      train_labels = save['train_labels']
      valid_dataset = save['valid_dataset']
      valid_labels = save['valid_labels']
      test_dataset = save['test_dataset']
      test_labels = save['test_labels']
      del save  # hint to help gc free up memory
      print('Training set', train_dataset.shape, train_labels.shape)
      print('Validation set', valid_dataset.shape, valid_labels.shape)
      print('Test set', test_dataset.shape, test_labels.shape)
    
    
    #prepare data to have right format for tensorflow
    #i.e. data is flat matrix, labels are onehot
    
    image_size = 28
    num_labels = 10
    
    def reformat(dataset, labels):
      dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
      # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
      labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
      return dataset, labels
    train_dataset, train_labels = reformat(train_dataset, train_labels)
    valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
    test_dataset, test_labels = reformat(test_dataset, test_labels)
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valid_dataset.shape, valid_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)
    
    
    #now is the interesting part - we are building a network with
    #one hidden ReLU layer and out usual output linear layer
    
    #we are going to use SGD so here is our size of batch
    batch_size = 128
    
    #building tensorflow graph
    graph = tf.Graph()
    with graph.as_default():
          # Input data. For the training data, we use a placeholder that will be fed
      # at run time with a training minibatch.
      tf_train_dataset = tf.placeholder(tf.float32,
                                        shape=(batch_size, image_size * image_size))
      tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
      tf_valid_dataset = tf.constant(valid_dataset)
      tf_test_dataset = tf.constant(test_dataset)
    
      #now let's build our new hidden layer
      #that's how many hidden neurons we want
      num_hidden_neurons = 1024
      #its weights
      hidden_weights = tf.Variable(
        tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
      hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))
    
      #now the layer itself. It multiplies data by weights, adds biases
      #and takes ReLU over result
      hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)
    
      #time to go for output linear layer
      #out weights connect hidden neurons to output labels
      #biases are added to output labels  
      out_weights = tf.Variable(
        tf.truncated_normal([num_hidden_neurons, num_labels]))  
    
      out_biases = tf.Variable(tf.zeros([num_labels]))  
    
      #compute output  
      out_layer = tf.matmul(hidden_layer,out_weights) + out_biases
      #our real output is a softmax of prior result
      #and we also compute its cross-entropy to get our loss
      loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels))
    
      #now we just minimize this loss to actually train the network
      optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    
      #nice, now let's calculate the predictions on each dataset for evaluating the
      #performance so far
      # Predictions for the training, validation, and test data.
      train_prediction = tf.nn.softmax(out_layer)
      valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
      valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) 
    
      test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
      test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)
    
  • GoingMyWay
    GoingMyWay over 7 years
    Hi, why should we add l2 regularization to biases, I think there is no need to add l2 regularization to biases term.
  • Keith Johnson
    Keith Johnson over 7 years
    You shouldn't regularize the biases, only the weights.
  • johndodo
    johndodo over 7 years
    @AlexanderYau: you are correct: "...For these reasons we don't usually include bias terms when regularizing" (see here)
  • stolsvik
    stolsvik almost 7 years
    Notice that for the list comprehension selecting away the bias'es, it depends on the actual /name/ of the tf variable, so if you haven't called it something with "bias" in it, the example won't select it away.
  • PhABC
    PhABC almost 7 years
    Absolutely! Which is why I specified that "This can be used to exclude other variables". It's good to point it out however, thank you.
  • Swair
    Swair almost 7 years
    why do you use reduce_mean? Isn't the output of l2_loss supposed to be a scalar?
  • SpaceMonkey
    SpaceMonkey almost 6 years
    why aren't you dividing by the number of samples??
  • mrgloom
    mrgloom over 4 years
    @Keith Johnson Can you give explanation?