Tensorflow - loss starts high and does not decrease

11,927

There's nothing wrong with 0.69 nats of entropy per samples, as a starting point for a binary classification.

If you convert to base 2, 0.69/log(2), you'll see that it's almost exactly 1 bit per sample which is exactly what you would expect if you're unsure about a binary classification.

I usually use the mean loss instead of the sum so things are less sensitive to batch size.

You should also not calculate the entropy directly yourself, because that method breaks easily. you probably want tf.nn.sigmoid_cross_entropy_with_logits.

I also like starting with the Adam Optimizer instead of pure gradient descent.

Here are two reasons you might be having some trouble with this problem:

1) Character codes are ordered, but the order doesn't mean anything. Your inputs would be easier for the network to take as input if they were input as one-hot vectors. So your input would be a 26x30 = 780 element vector. Without that the network has to waste a bunch of capacity learning the boundaries between letters.

2) You've only got fully connected layers. This makes it impossible for it to learn a fact independent of it's absolute position in the name. 6 of the top 10 girls names in 2015 ended in 'a', while 0 of the top 10 boys names did. As currently written your network needs to re-learn "Usually it's a girl's name if it ends in 'a'" independently for each name length. Using some convolution layers would allow it to learn facts once across all name lengths.

Share:
11,927
J Polack
Author by

J Polack

Updated on June 05, 2022

Comments

  • J Polack
    J Polack almost 2 years

    i started writing Neuronal Networks with tensorflow and there is one Problem i seem to face in each of my example Projects.

    My loss allways starts at something like 50 or higher and does not decrease or if it does, it does so slowly that after all my epochs i do not even get near an acceptable loss-rate.

    Things it already tried (and did not affect the result very much)

    • tested on overfitting, but in the following example you can see that i have 15000 training and 15000 testing-datasets and something like 900 neurons
    • tested different optimizers and optimizer-values
    • tried increasing the traingdata by using the testdata as trainingdata aswell
    • tried increasing and decreasing the batchsize

    I created the network on knowledge of https://youtu.be/vq2nnJ4g6N0

    But let us have a look on one of my testprojects:

    I have a list of names and wanted to assume the gender so my raw data looks like this:

    names=["Maria","Paul","Emilia",...]
    
    genders=["f","m","f",...]
    

    For feeding it into the network i transform the names into an array of charCodes (expecting a maxlength of 30) and the gender into a bit array

    names=[[77.,97. ,114.,105.,97. ,0. ,0.,...]
           [80.,97. ,117.,108.,0.  ,0. ,0.,...]
           [69.,109.,105.,108.,105.,97.,0.,...]]
    
    genders=[[1.,0.]
             [0.,1.]
             [1.,0.]]
    

    I built the network with 3 hidden layers [30,20],[20,10],[10,10] and [10,2] for the output layer. All hidden layers have a ReLU as activation function. The output layer has a softmax.

    # Input Layer
    x = tf.placeholder(tf.float32, shape=[None, 30])
    y_ = tf.placeholder(tf.float32, shape=[None, 2])
    
    # Hidden Layers
    # H1
    W1 = tf.Variable(tf.truncated_normal([30, 20], stddev=0.1))
    b1 = tf.Variable(tf.zeros([20]))
    y1 = tf.nn.relu(tf.matmul(x, W1) + b1)
    
    # H2
    W2 = tf.Variable(tf.truncated_normal([20, 10], stddev=0.1))
    b2 = tf.Variable(tf.zeros([10]))
    y2 = tf.nn.relu(tf.matmul(y1, W2) + b2)
    
    # H3
    W3 = tf.Variable(tf.truncated_normal([10, 10], stddev=0.1))
    b3 = tf.Variable(tf.zeros([10]))
    y3 = tf.nn.relu(tf.matmul(y2, W3) + b3)
    
    # Output Layer
    W = tf.Variable(tf.truncated_normal([10, 2], stddev=0.1))
    b = tf.Variable(tf.zeros([2]))
    y = tf.nn.softmax(tf.matmul(y3, W) + b)
    

    Now the calculation for the loss, accuracy and the training operation:

    # Loss
    cross_entropy = -tf.reduce_sum(y_*tf.log(y))
    
    # Accuracy
    is_correct = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
    
    # Training
    train_operation = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
    

    I train the network in batches of 100

    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    for i in range(150):
        bs = 100
        index = i*bs
        inputBatch = inputData[index:index+bs]
        outputBatch = outputData[index:index+bs]
    
        sess.run(train_operation, feed_dict={x: inputBatch, y_: outputBatch})
        accuracyTrain, lossTrain = sess.run([accuracy, cross_entropy], feed_dict={x: inputBatch, y_: outputBatch})
    
        if i%(bs/10) == 0:
            print("step %d loss %.2f accuracy %.2f" % (i, lossTrain, accuracyTrain))
    

    And i get the following result:

    step 0 loss 68.96 accuracy 0.55
    step 10 loss 69.32 accuracy 0.50
    step 20 loss 69.31 accuracy 0.50
    step 30 loss 69.31 accuracy 0.50
    step 40 loss 69.29 accuracy 0.51
    step 50 loss 69.90 accuracy 0.53
    step 60 loss 68.92 accuracy 0.55
    step 70 loss 68.99 accuracy 0.55
    step 80 loss 69.49 accuracy 0.49
    step 90 loss 69.25 accuracy 0.52
    step 100 loss 69.39 accuracy 0.49
    step 110 loss 69.32 accuracy 0.47
    step 120 loss 67.17 accuracy 0.61
    step 130 loss 69.34 accuracy 0.50
    step 140 loss 69.33 accuracy 0.47
    


    What am i doing wrong?

    Why does it start at ~69 in my Project and not lower?


    Thank you very much guys!

  • J Polack
    J Polack over 7 years
    Tried it with "one-hot" Vectors cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logit‌​s, y)) and train_operation = tf.train.AdamOptimizer(0.0003).minimize(cross_entropy) because i did not feel ready for convolutional networks yet the loss drops now, but the accuracy stays the same, might this go away with a convolutional network?
  • mdaoust
    mdaoust over 7 years
    I don't know. The fully connected layers don't represent the fact that the input is a sequence. Any sort of bi-gram model would also reflect that. Have you considered passing the bi-gram counts as inputs? Your input would be (26+2)**2 when you include the "Start" and "end" symbols. Make sure to regularize.