CS231n: How to calculate gradient for Softmax loss function?

35,643

Solution 1

Not sure if this helps, but:

y_i is really the indicator function y_i, as described here. This forms the expression (j == y[i]) in the code.

Also, the gradient of the loss with respect to the weights is:

y_i

where

y_i

which is the origin of the X[:,i] in the code.

Solution 2

I know this is late but here's my answer:

I'm assuming you are familiar with the cs231n Softmax loss function. We know that: enter image description here

So just as we did with the SVM loss function the gradients are as follows: enter image description here

Hope that helped.

Solution 3

A supplement to this answer with a small example.

calculation

Share:
35,643
Nghia Tran
Author by

Nghia Tran

Updated on April 13, 2021

Comments

  • Nghia Tran
    Nghia Tran about 3 years

    I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.

    From this stackexchange answer, softmax gradient is calculated as:

    derivative calculation

    Python implementation for above is:

    num_classes = W.shape[0]
    num_train = X.shape[1]
    for i in range(num_train):
      for j in range(num_classes):
        p = np.exp(f_i[j])/sum_i
        dW[j, :] += (p-(j == y[i])) * X[:, i]
    

    Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.

    def softmax_loss_naive(W, X, y, reg):
      """
      Softmax loss function, naive implementation (with loops)
      Inputs:
      - W: C x D array of weights
      - X: D x N array of data. Data are D-dimensional columns
      - y: 1-dimensional array of length N with labels 0...K-1, for K classes
      - reg: (float) regularization strength
      Returns:
      a tuple of:
      - loss as single float
      - gradient with respect to weights W, an array of same size as W
      """
      # Initialize the loss and gradient to zero.
      loss = 0.0
      dW = np.zeros_like(W)
    
      #############################################################################
      # Compute the softmax loss and its gradient using explicit loops.           #
      # Store the loss in loss and the gradient in dW. If you are not careful     #
      # here, it is easy to run into numeric instability. Don't forget the        #
      # regularization!                                                           #
      #############################################################################
    
      # Get shapes
      num_classes = W.shape[0]
      num_train = X.shape[1]
    
      for i in range(num_train):
        # Compute vector of scores
        f_i = W.dot(X[:, i]) # in R^{num_classes}
    
        # Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
        log_c = np.max(f_i)
        f_i -= log_c
    
        # Compute loss (and add to it, divided later)
        # L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
        sum_i = 0.0
        for f_i_j in f_i:
          sum_i += np.exp(f_i_j)
        loss += -f_i[y[i]] + np.log(sum_i)
    
        # Compute gradient
        # dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
        # Here we are computing the contribution to the inner sum for a given i.
        for j in range(num_classes):
          p = np.exp(f_i[j])/sum_i
          dW[j, :] += (p-(j == y[i])) * X[:, i]
    
      # Compute average
      loss /= num_train
      dW /= num_train
    
      # Regularization
      loss += 0.5 * reg * np.sum(W * W)
      dW += reg*W
    
      return loss, dW
    
  • Nghia Tran
    Nghia Tran over 7 years
    Thank for pointing that out. I didn't see it in first place. In the question on stackexchange, they implicitly denote yj for for the indicator function
  • Awaldeep Singh
    Awaldeep Singh over 5 years
    And,the value of the first term(dL/df) in the gradient is: y_pred-y.