Different loss functions for backpropagation

11,695

This was a very broad question, but I can shed some light on the error / cost function part.

Cost functions

There are many different cost functions that can be applied when working with neural networks. There are no neural network specific cost functions. The most common cost function in NN is probably the Mean Squared Error (MSE) and the Cross Entropy Cost function. The latter cost function is often the most appropriate when working with logistic or softmax output layers. The MSE cost function on the other hand, is convenient since it does not require the output values to be in the range [0, 1].

The different cost functions excerts different convergence properties and has their own pros and cons. You'll have to read up on those that are interesting to you.

List of cost functions

Danielle Ensign has compiled a short, nice list of cost functions over at CrossValidated.


Sidenote

You have confused the derivative of squared error function. The equation you've defined as the derivative of the error function, is actually the derivative of the error functions times the derivative of your output layer activation function. This multiplication calculates the delta of the output layer.

The squared error function and its derivative are defined as:
enter image description here

While the sigmoid activation function and its derivative are defined as:
enter image description here

The delta of the output layer is defined as:
enter image description here
And this is true for all cost functions.

Share:
11,695

Related videos on Youtube

Rottjung
Author by

Rottjung

Updated on June 04, 2022

Comments

  • Rottjung
    Rottjung almost 2 years

    I came across some different error calculation functions for backpropagation: Squared error function from http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

    enter image description here

    or a nice explanation for the derivation for the BP loss function

    Error = Output(i) * (1 - Output(i)) * (Target(i) - Output(i))
    

    Now, I'm wondering how many more there are, and what difference in effect it has on training?

    Also, since I understand that the second example uses the derivative of the activation function used by the layer, does the first one also does this in a way? And would it be true for any loss function (if there are more)?

    Finally, how to know which one to use, and when?

  • Rottjung
    Rottjung about 8 years
    yes i was just about to change my question as i realised the first one was only partial and also got multiplied by the activation functions derivative later on :-) but you answered my question perfectly, telling me there is also a derivative use from the loss function! only now i wonder why, when and where to use it...
  • jorgenkg
    jorgenkg about 8 years
    The final three lines of my answer touches upon when and where. You multiply the derivative of the cost function with the derivative of the activation function in the output layer in order to calculate the delta of the output layer.