Backpropagation for rectified linear unit activation with cross entropy error

10,753

Every squashing function sigmoid, tanh and softmax (in the output layer) means different cost functions. Then makes sense that a RLU (in the output layer) does not match with the cross entropy cost function. I will try a simple square error cost function to test a RLU output layer.

The true power of RLU is in the hidden layers of a deep net since it not suffer from gradient vanishing error.

Share:
10,753
Pr1mer
Author by

Pr1mer

Updated on June 25, 2022

Comments

  • Pr1mer
    Pr1mer almost 2 years

    I'm trying to implement gradient calculation for neural networks using backpropagation. I cannot get it to work with cross entropy error and rectified linear unit (ReLU) as activation.

    I managed to get my implementation working for squared error with sigmoid, tanh and ReLU activation functions. Cross entropy (CE) error with sigmoid activation gradient is computed correctly. However, when I change activation to ReLU - it fails. (I'm skipping tanh for CE as it retuls values in (-1,1) range.)

    Is it because of the behavior of log function at values close to 0 (which is returned by ReLUs approx. 50% of the time for normalized inputs)? I tried to mitiage that problem with:

    log(max(y,eps))
    

    but it only helped to bring error and gradients back to real numbers - they are still different from numerical gradient.

    I verify the results using numerical gradient:

    num_grad = (f(W+epsilon) - f(W-epsilon)) / (2*epsilon)
    

    The following matlab code presents a simplified and condensed backpropagation implementation used in my experiments:

    function [f, df] = backprop(W, X, Y)
    % W - weights
    % X - input values
    % Y - target values
    
    act_type='relu';    % possible values: sigmoid / tanh / relu
    error_type = 'CE';  % possible values: SE / CE
    
    N=size(X,1); n_inp=size(X,2); n_hid=100; n_out=size(Y,2);
    w1=reshape(W(1:n_hid*(n_inp+1)),n_hid,n_inp+1);
    w2=reshape(W(n_hid*(n_inp+1)+1:end),n_out, n_hid+1);
    
    % feedforward
    X=[X ones(N,1)];
    z2=X*w1'; a2=act(z2,act_type); a2=[a2 ones(N,1)];
    z3=a2*w2'; y=act(z3,act_type);
    
    if strcmp(error_type, 'CE')   % cross entropy error - logistic cost function
        f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
    else % squared error
        f=0.5*sum(sum((y-Y).^2));
    end
    
    % backprop
    if strcmp(error_type, 'CE')   % cross entropy error
        d3=y-Y;
    else % squared error
        d3=(y-Y).*dact(z3,act_type);
    end
    
    df2=d3'*a2;
    d2=d3*w2(:,1:end-1).*dact(z2,act_type);
    df1=d2'*X;
    
    df=[df1(:);df2(:)];
    
    end
    
    function f=act(z,type) % activation function
    switch type
        case 'sigmoid'
            f=1./(1+exp(-z));
        case 'tanh'
            f=tanh(z);
        case 'relu'
            f=max(0,z);
    end
    end
    
    function df=dact(z,type) % derivative of activation function
    switch type
        case 'sigmoid'
            df=act(z,type).*(1-act(z,type));
        case 'tanh'
            df=1-act(z,type).^2;
        case 'relu'
            df=double(z>0);
    end
    end
    

    Edit

    After another round of experiments, I found out that using a softmax for the last layer:

    y=bsxfun(@rdivide, exp(z3), sum(exp(z3),2));
    

    and softmax cost function:

    f=-sum(sum(Y.*log(y)));
    

    make the implementaion working for all activation functions including ReLU.

    This leads me to conclusion that it is the logistic cost function (binary clasifier) that does not work with ReLU:

    f=-sum(sum( Y.*log(max(y,eps))+(1-Y).*log(max(1-y,eps)) ));
    

    However, I still cannot figure out where the problem lies.

  • Pr1mer
    Pr1mer almost 10 years
    The derivative of ReLU funtion is: df=0 for input<= 0 and df=1 for input>0 which in matlab is equivalent to double(z>0). d3 is the delta of the last layer and it is the correct form. ReLU has advantages over softplus function - check here for instance.
  • Pr1mer
    Pr1mer over 9 years
    I came to the similar conclusion after going through several papers on NNs. When I need classification, the output layer for softmax is composed of sigmoid units. Other layers (hidden) remain composed of ReLUs.