In which cases is the cross-entropy preferred over the mean squared error?

46,753

Solution 1

Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression. This comes directly from the statement of the problems itself - in classification you work with very particular set of possible output values thus MSE is badly defined (as it does not have this kind of knowledge thus penalizes errors in incompatible way). To better understand the phenomena it is good to follow and understand the relations between

  1. cross entropy
  2. logistic regression (binary cross entropy)
  3. linear regression (MSE)

You will notice that both can be seen as a maximum likelihood estimators, simply with different assumptions about the dependent variable.

Solution 2

When you derive the cost function from the aspect of probability and distribution, you can observe that MSE happens when you assume the error follows Normal Distribution and cross entropy when you assume binomial distribution. It means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification. Hope it helps a little bit.

Solution 3

If you do logistic regression for example, you will use the sigmoid function to estimate de probability, the cross entropy as the loss function and gradient descent to minimize it. Doing this but using MSE as the loss function might lead to a non-convex problem where you might find local minima. Using cross entropy will lead to a convex problem where you might find the optimum solution.

https://www.youtube.com/watch?v=rtD0RvfBJqQ&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=35

There is also an interesting analysis here: https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

Share:
46,753
Amogh Mishra
Author by

Amogh Mishra

Updated on January 06, 2020

Comments

  • Amogh Mishra
    Amogh Mishra over 4 years

    Although both of the above methods provide a better score for the better closeness of prediction, still cross-entropy is preferred. Is it in every case or there are some peculiar scenarios where we prefer cross-entropy over MSE?

  • yuefengz
    yuefengz over 7 years
    Could you please elaborate more on "assumptions about the dependent variable" ?
  • lejlot
    lejlot over 6 years
    @Fake - as Duc pointed out in the separate answer, logistic regression assumes binomial distribution (or multinomial in generalised case of cross entropy and softmax) of the dependent variable, while linear regression assumes that it is a linear function of the variables plus an IID sampled noise from a 0-mean gaussian noise with fixed variance.
  • Paul
    Paul over 5 years
    The youtube link no longer works.
  • akshit bhatia
    akshit bhatia about 5 years
    Say we have 2 probability distribution vectors:- actual [0.3, 0.5, 0.1, 0.1] and predicted [0.4, 0.2, 0.3, 0.1] Now if we use MSE to determine our loss, why would this be a bad choice than KL divergence? What are the features that are missed when we perform MSE on such a data?
  • Kunyu Shi
    Kunyu Shi almost 5 years
    Could you show how gaussian leads to MSE and binomial leads to cross entropy?
  • A_P
    A_P over 4 years
    @KunyuShi Look at the PDF/PMF of the normal and Bernoulli distributions. If we take their log (which we generally do, to simplify the loss function) we get MSE and binary crossentropy, respectively.
  • SomethingSomething
    SomethingSomething almost 4 years
    I once trained a single output neuron using MSE-loss to output 0 or 1 [for negative and positive classes]. The result was that all the outputs were at the extremes - you couldn't pick a threshold. Using two neurons with CE loss got me a much smoother result, so I could pick a threshold. Probably BCE is what you want to use if you stay with a single neuron.