pytorch loss value not change

10,858

Solution 1

I realised that L2_loss in Adam Optimizer make loss value remain unchanged (I haven't tried in other Optimizer yet). It works when I remove L2_loss:

# optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)
optimizer = optim.Adam(model.parameters(), lr=0.001)

=== UPDATE (See above answer for more detail!) ===

self.features = nn.Sequential(self.flat_layer)
self.classifier = nn.Linear(out_channels * len(filter_sizes), num_classes)

...

optimizer = optim.Adam([
    {'params': model.features.parameters()},
    {'params': model.classifier.parameters(), 'weight_decay': 0.1}
], lr=0.001)

Solution 2

I have seen that in your original code, weight_decay term is set to be 0.1. weight_decay is used to regularize the network's parameters. This term maybe too strong so that the regularization is too much. Try to reduce the value of weight_decay.

For convolutional neural networks in computer vision tasks. weight_decay term are usually set to be 5e-4 or 5e-5. I am not familiar with text classification. These values may work for you out of the box or you have to tweak it a little bit by trial and error.

Let me know if it works for you.

Share:
10,858

Related videos on Youtube

Viet Phan
Author by

Viet Phan

Updated on June 04, 2022

Comments

  • Viet Phan
    Viet Phan almost 2 years

    I wrote a module based on this article: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

    The idea is pass the input into multiple streams then concat together and connect to a FC layer. I divided my source code into 3 custom modules: TextClassifyCnnNet >> FlatCnnLayer >> FilterLayer

    FilterLayer:

    class FilterLayer(nn.Module):
        def __init__(self, filter_size, embedding_size, sequence_length, out_channels=128):
            super(FilterLayer, self).__init__()
    
            self.model = nn.Sequential(
                nn.Conv2d(1, out_channels, (filter_size, embedding_size)),
                nn.ReLU(inplace=True),
                nn.MaxPool2d((sequence_length - filter_size + 1, 1), stride=1)
            )
    
            for m in self.modules():
                if isinstance(m, nn.Conv2d):
                    n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                    m.weight.data.normal_(0, math.sqrt(2. / n))
    
        def forward(self, x):
            return self.model(x)
    

    FlatCnnLayer:

    class FlatCnnLayer(nn.Module):
        def __init__(self, embedding_size, sequence_length, filter_sizes=[3, 4, 5], out_channels=128):
            super(FlatCnnLayer, self).__init__()
    
            self.filter_layers = nn.ModuleList(
                [FilterLayer(filter_size, embedding_size, sequence_length, out_channels=out_channels) for
                 filter_size in filter_sizes])
    
        def forward(self, x):
            pools = []
            for filter_layer in self.filter_layers:
                out_filter = filter_layer(x)
                # reshape from (batch_size, out_channels, h, w) to (batch_size, h, w, out_channels)
                pools.append(out_filter.view(out_filter.size()[0], 1, 1, -1))
            x = torch.cat(pools, dim=3)
    
            x = x.view(x.size()[0], -1)
            x = F.dropout(x, p=dropout_prob, training=True)
    
            return x
    

    TextClassifyCnnNet (main module):

    class TextClassifyCnnNet(nn.Module):
        def __init__(self, embedding_size, sequence_length, num_classes, filter_sizes=[3, 4, 5], out_channels=128):
            super(TextClassifyCnnNet, self).__init__()
    
            self.flat_layer = FlatCnnLayer(embedding_size, sequence_length, filter_sizes=filter_sizes,
                                           out_channels=out_channels)
    
            self.model = nn.Sequential(
                self.flat_layer,
                nn.Linear(out_channels * len(filter_sizes), num_classes)
            )
    
        def forward(self, x):
            x = self.model(x)
    
            return x
    
    
    def fit(net, data, save_path):
        if torch.cuda.is_available():
            net = net.cuda()
    
        for param in list(net.parameters()):
            print(type(param.data), param.size())
    
        optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)
    
        X_train, X_test = data['X_train'], data['X_test']
        Y_train, Y_test = data['Y_train'], data['Y_test']
    
        X_valid, Y_valid = data['X_valid'], data['Y_valid']
    
        n_batch = len(X_train) // batch_size
    
        for epoch in range(1, n_epochs + 1):  # loop over the dataset multiple times
            net.train()
            start = 0
            end = batch_size
    
            for batch_idx in range(1, n_batch + 1):
                # get the inputs
                x, y = X_train[start:end], Y_train[start:end]
                start = end
                end = start + batch_size
    
                # zero the parameter gradients
                optimizer.zero_grad()
    
                # forward + backward + optimize
                predicts = _get_predict(net, x)
                loss = _get_loss(predicts, y)
                loss.backward()
                optimizer.step()
    
                if batch_idx % display_step == 0:
                    print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                        epoch, batch_idx * len(x), len(X_train), 100. * batch_idx / (n_batch + 1), loss.data[0]))
    
            # print statistics
            if epoch % display_step == 0 or epoch == 1:
                net.eval()
                valid_predicts = _get_predict(net, X_valid)
                valid_loss = _get_loss(valid_predicts, Y_valid)
                valid_accuracy = _get_accuracy(valid_predicts, Y_valid)
                print('\r[%d] loss: %.3f - accuracy: %.2f' % (epoch, valid_loss.data[0], valid_accuracy * 100))
    
        print('\rFinished Training\n')
    
        net.eval()
    
        test_predicts = _get_predict(net, X_test)
        test_loss = _get_loss(test_predicts, Y_test).data[0]
        test_accuracy = _get_accuracy(test_predicts, Y_test)
        print('Test loss: %.3f - Test accuracy: %.2f' % (test_loss, test_accuracy * 100))
    
        torch.save(net.flat_layer.state_dict(), save_path)
    
    
    def _get_accuracy(predicts, labels):
        predicts = torch.max(predicts, 1)[1].data[0]
        return np.mean(predicts == labels)
    
    
    def _get_predict(net, x):
        # wrap them in Variable
        inputs = torch.from_numpy(x).float()
        # convert to cuda tensors if cuda flag is true
        if torch.cuda.is_available:
            inputs = inputs.cuda()
        inputs = Variable(inputs)
        return net(inputs)
    
    
    def _get_loss(predicts, labels):
        labels = torch.from_numpy(labels).long()
        # convert to cuda tensors if cuda flag is true
        if torch.cuda.is_available:
            labels = labels.cuda()
        labels = Variable(labels)
        return F.cross_entropy(predicts, labels)
    

    It seems that parameters 're just updated slightly each epoch, the accuracy remains for all the process. While with the same implementation and the same params in Tensorflow, it runs correctly.

    I'm new to Pytorch, so maybe my instructions has something wrong, please help me to find out. Thank you!

    P.s: I try to use F.nll_loss + F.log_softmax instead of F.cross_entropy. Theoretically, it should return the same, but in fact another result is printed out (but it still be a wrong loss value)

  • Viet Phan
    Viet Phan over 6 years
    how can I set weight_decay for Fully Connected Layer only? or set specific weight_decay for each type of layer
  • jdhao
    jdhao over 6 years
    This is easy to achieve in PyTorch. The optimizer accept parameter groups, and in each parameter group, you can set lr, weight_decay separately. See here for more info. Also, searching google for different learning rate for different layer in pytorch will give you pretty much information. Another resource is the wonderful PyTorch forum. Make sure to search through the forum before posting your questions as many questions have already been asked and have good answers.
  • jdhao
    jdhao over 6 years
    @VietPhan does decreasing weight decay value works for you?
  • Viet Phan
    Viet Phan over 6 years
    i disabled weight decay for conv2d, and only used on FC. It works!