In which cases is the cross-entropy preferred over the mean squared error?

machine-learning neural-network backpropagation mean-square-error cross-entropy

46,753

Solution 1

Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression. This comes directly from the statement of the problems itself - in classification you work with very particular set of possible output values thus MSE is badly defined (as it does not have this kind of knowledge thus penalizes errors in incompatible way). To better understand the phenomena it is good to follow and understand the relations between

cross entropy
logistic regression (binary cross entropy)
linear regression (MSE)

You will notice that both can be seen as a maximum likelihood estimators, simply with different assumptions about the dependent variable.

Solution 2

When you derive the cost function from the aspect of probability and distribution, you can observe that MSE happens when you assume the error follows Normal Distribution and cross entropy when you assume binomial distribution. It means that implicitly when you use MSE, you are doing regression (estimation) and when you use CE, you are doing classification. Hope it helps a little bit.

Solution 3

If you do logistic regression for example, you will use the sigmoid function to estimate de probability, the cross entropy as the loss function and gradient descent to minimize it. Doing this but using MSE as the loss function might lead to a non-convex problem where you might find local minima. Using cross entropy will lead to a convex problem where you might find the optimum solution.

https://www.youtube.com/watch?v=rtD0RvfBJqQ&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=35

There is also an interesting analysis here: https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

46,753

Author by

Amogh Mishra

Updated on January 06, 2020

Comments

Amogh Mishra over 4 years

Although both of the above methods provide a better score for the better closeness of prediction, still cross-entropy is preferred. Is it in every case or there are some peculiar scenarios where we prefer cross-entropy over MSE?
yuefengz over 7 years

Could you please elaborate more on "assumptions about the dependent variable" ?
lejlot over 6 years

@Fake - as Duc pointed out in the separate answer, logistic regression assumes binomial distribution (or multinomial in generalised case of cross entropy and softmax) of the dependent variable, while linear regression assumes that it is a linear function of the variables plus an IID sampled noise from a 0-mean gaussian noise with fixed variance.
Paul over 5 years

The youtube link no longer works.
akshit bhatia about 5 years

Say we have 2 probability distribution vectors:- actual [0.3, 0.5, 0.1, 0.1] and predicted [0.4, 0.2, 0.3, 0.1] Now if we use MSE to determine our loss, why would this be a bad choice than KL divergence? What are the features that are missed when we perform MSE on such a data?
Kunyu Shi almost 5 years

Could you show how gaussian leads to MSE and binomial leads to cross entropy?
A_P over 4 years

@KunyuShi Look at the PDF/PMF of the normal and Bernoulli distributions. If we take their log (which we generally do, to simplify the loss function) we get MSE and binary crossentropy, respectively.
SomethingSomething almost 4 years

I once trained a single output neuron using MSE-loss to output 0 or 1 [for negative and positive classes]. The result was that all the outputs were at the extremes - you couldn't pick a threshold. Using two neurons with CE loss got me a much smoother result, so I could pick a threshold. Probably BCE is what you want to use if you stay with a single neuron.