Why do we use fully-connected layer at the end of CNN?

12,332

Solution 1

Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.

Why do we use FC layers then?

Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.

Solution 2

In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).

So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.

However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.

So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.

note I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.

Solution 3

I found this answer by Anil-Sharma on Quora helpful.

We can divide the whole network (for classification) into two parts:

  • Feature extraction: In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.

  • Classification: After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.

Solution 4

The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.

Solution 5

Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.

Share:
12,332
Ahmed Salah
Author by

Ahmed Salah

In love with coding.

Updated on June 18, 2022

Comments

  • Ahmed Salah
    Ahmed Salah almost 2 years

    I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?

  • Martin Thoma
    Martin Thoma about 7 years
    "If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels)" - could you please expand that? By now, this statement is false for infinitely many CNN architectures.
  • Martin Thoma
    Martin Thoma about 7 years
    "Convolutions are local operations" - not necessarily. You can make a convolutional layer which has the size of the feature map (without padding). Then the convolutional layer is a global operation.
  • lejlot
    lejlot about 7 years
    Yes, this is a simplification, they are typically local, of course you can go to such an extreme where it degenerates back to the MLP but then calling it a convolution is questionable.
  • lejlot
    lejlot about 7 years
    I added a note at the end to address this - basically this corner case does not change anything, since the whole point here is about local vs global. If you use cnns in such a way where the dependence is global - you will be fine without MLP (as stated at the very begining of the answer)
  • Martin Thoma
    Martin Thoma about 7 years
    Ah, I think I understand now what you mean by "that each output neuron [...] depends only on the subset of the input dimensions". You're talking about only one single convolutional layer, not about the usual many convolutional / pooling layers, right?
  • lejlot
    lejlot about 7 years
    Even with multiple layers you do not get a dependence on all of them, every single layer increases this dependence (effective receptive field), and at some point can even hit the whole image, once you do it - you do not really need FCN anymore (can still be useful of course, simply not required).