pytorch loss value not change

python deep-learning pytorch

10,858

Solution 1

I realised that L2_loss in Adam Optimizer make loss value remain unchanged (I haven't tried in other Optimizer yet). It works when I remove L2_loss:

# optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)
optimizer = optim.Adam(model.parameters(), lr=0.001)

=== UPDATE (See above answer for more detail!) ===

self.features = nn.Sequential(self.flat_layer)
self.classifier = nn.Linear(out_channels * len(filter_sizes), num_classes)

...

optimizer = optim.Adam([
    {'params': model.features.parameters()},
    {'params': model.classifier.parameters(), 'weight_decay': 0.1}
], lr=0.001)

Solution 2

I have seen that in your original code, weight_decay term is set to be 0.1. weight_decay is used to regularize the network's parameters. This term maybe too strong so that the regularization is too much. Try to reduce the value of weight_decay.

For convolutional neural networks in computer vision tasks. weight_decay term are usually set to be 5e-4 or 5e-5. I am not familiar with text classification. These values may work for you out of the box or you have to tweak it a little bit by trial and error.

Let me know if it works for you.

10,858

Viet Phan

Updated on June 04, 2022

Comments

Viet Phan almost 2 years

I wrote a module based on this article: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

The idea is pass the input into multiple streams then concat together and connect to a FC layer. I divided my source code into 3 custom modules: TextClassifyCnnNet >> FlatCnnLayer >> FilterLayer

FilterLayer:

class FilterLayer(nn.Module):
    def __init__(self, filter_size, embedding_size, sequence_length, out_channels=128):
        super(FilterLayer, self).__init__()

        self.model = nn.Sequential(
            nn.Conv2d(1, out_channels, (filter_size, embedding_size)),
            nn.ReLU(inplace=True),
            nn.MaxPool2d((sequence_length - filter_size + 1, 1), stride=1)
        )

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))

    def forward(self, x):
        return self.model(x)

FlatCnnLayer:

class FlatCnnLayer(nn.Module):
    def __init__(self, embedding_size, sequence_length, filter_sizes=[3, 4, 5], out_channels=128):
        super(FlatCnnLayer, self).__init__()

        self.filter_layers = nn.ModuleList(
            [FilterLayer(filter_size, embedding_size, sequence_length, out_channels=out_channels) for
             filter_size in filter_sizes])

    def forward(self, x):
        pools = []
        for filter_layer in self.filter_layers:
            out_filter = filter_layer(x)
            # reshape from (batch_size, out_channels, h, w) to (batch_size, h, w, out_channels)
            pools.append(out_filter.view(out_filter.size()[0], 1, 1, -1))
        x = torch.cat(pools, dim=3)

        x = x.view(x.size()[0], -1)
        x = F.dropout(x, p=dropout_prob, training=True)

        return x

TextClassifyCnnNet (main module):

class TextClassifyCnnNet(nn.Module):
    def __init__(self, embedding_size, sequence_length, num_classes, filter_sizes=[3, 4, 5], out_channels=128):
        super(TextClassifyCnnNet, self).__init__()

        self.flat_layer = FlatCnnLayer(embedding_size, sequence_length, filter_sizes=filter_sizes,
                                       out_channels=out_channels)

        self.model = nn.Sequential(
            self.flat_layer,
            nn.Linear(out_channels * len(filter_sizes), num_classes)
        )

    def forward(self, x):
        x = self.model(x)

        return x


def fit(net, data, save_path):
    if torch.cuda.is_available():
        net = net.cuda()

    for param in list(net.parameters()):
        print(type(param.data), param.size())

    optimizer = optim.Adam(net.parameters(), lr=0.01, weight_decay=0.1)

    X_train, X_test = data['X_train'], data['X_test']
    Y_train, Y_test = data['Y_train'], data['Y_test']

    X_valid, Y_valid = data['X_valid'], data['Y_valid']

    n_batch = len(X_train) // batch_size

    for epoch in range(1, n_epochs + 1):  # loop over the dataset multiple times
        net.train()
        start = 0
        end = batch_size

        for batch_idx in range(1, n_batch + 1):
            # get the inputs
            x, y = X_train[start:end], Y_train[start:end]
            start = end
            end = start + batch_size

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            predicts = _get_predict(net, x)
            loss = _get_loss(predicts, y)
            loss.backward()
            optimizer.step()

            if batch_idx % display_step == 0:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(x), len(X_train), 100. * batch_idx / (n_batch + 1), loss.data[0]))

        # print statistics
        if epoch % display_step == 0 or epoch == 1:
            net.eval()
            valid_predicts = _get_predict(net, X_valid)
            valid_loss = _get_loss(valid_predicts, Y_valid)
            valid_accuracy = _get_accuracy(valid_predicts, Y_valid)
            print('\r[%d] loss: %.3f - accuracy: %.2f' % (epoch, valid_loss.data[0], valid_accuracy * 100))

    print('\rFinished Training\n')

    net.eval()

    test_predicts = _get_predict(net, X_test)
    test_loss = _get_loss(test_predicts, Y_test).data[0]
    test_accuracy = _get_accuracy(test_predicts, Y_test)
    print('Test loss: %.3f - Test accuracy: %.2f' % (test_loss, test_accuracy * 100))

    torch.save(net.flat_layer.state_dict(), save_path)


def _get_accuracy(predicts, labels):
    predicts = torch.max(predicts, 1)[1].data[0]
    return np.mean(predicts == labels)


def _get_predict(net, x):
    # wrap them in Variable
    inputs = torch.from_numpy(x).float()
    # convert to cuda tensors if cuda flag is true
    if torch.cuda.is_available:
        inputs = inputs.cuda()
    inputs = Variable(inputs)
    return net(inputs)


def _get_loss(predicts, labels):
    labels = torch.from_numpy(labels).long()
    # convert to cuda tensors if cuda flag is true
    if torch.cuda.is_available:
        labels = labels.cuda()
    labels = Variable(labels)
    return F.cross_entropy(predicts, labels)

It seems that parameters 're just updated slightly each epoch, the accuracy remains for all the process. While with the same implementation and the same params in Tensorflow, it runs correctly.

I'm new to Pytorch, so maybe my instructions has something wrong, please help me to find out. Thank you!

P.s: I try to use F.nll_loss + F.log_softmax instead of F.cross_entropy. Theoretically, it should return the same, but in fact another result is printed out (but it still be a wrong loss value)

Viet Phan over 6 years

how can I set weight_decay for Fully Connected Layer only? or set specific weight_decay for each type of layer
jdhao over 6 years

This is easy to achieve in PyTorch. The optimizer accept parameter groups, and in each parameter group, you can set lr, weight_decay separately. See here for more info. Also, searching google for different learning rate for different layer in pytorch will give you pretty much information. Another resource is the wonderful PyTorch forum. Make sure to search through the forum before posting your questions as many questions have already been asked and have good answers.
jdhao over 6 years

@VietPhan does decreasing weight decay value works for you?
Viet Phan over 6 years

i disabled weight decay for conv2d, and only used on FC. It works!