Create model using one - hot encoding in Keras

15,673

Cool, you cleaned up the question. You want to classify a sentence. I am assuming you said I want to do better than the bag-of-words encoding. You want to place importance on the sequence.

We'll choose a new model then -- an RNN (the LSTM version). This model effectively sums over the importance of each word ( in sequence ) as it builds up a representation of the sentence that best fits the task.

But we're going to have to handle the preprocessing a bit differently. For efficiency ( so that we can process more sentences together in a batch as opposed to single sentence at a time) we want all sentences to have the same amount of words. So We choose a max_words, say 20 and we pad shorter sentences to reach the max words and and we cut sentences longer than 20 words down.

Keras is going to help with that. We'll encode every word with a integer.

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

num_classes = 5 
max_words = 20
sentences = ["The cat is in the house",
                           "The green boy",
            "computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)

words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
vocab_size = len(words)

So "the green boy" might be [1, 3, 5] now. Then we'll pad and one-hot encode with

# pad to max_words length and encode with len(words) + 1  
# + 1 because we'll reserve 0 add the padding sentinel.
X = np.array([to_categorical(pad_sequences((sent,), max_words),  
       vocab_size + 1)  for sent in sent_ints])
print(X.shape) # (3, 20, 16)

Now to the model: we'll add a Dense layer to convert those one hot words to dense vectors. And then we use an LSTM to convert word vectors in sentence to a dense sentence vector. And finally we'll use softmax activation to produce a probability distribution over the classes.

model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

That should complle. You can then carry on with training.

model.fit(X,y)

EDIT:

this line:

# we need to split the sentences in a words write now it reading every
# letter notice the sent.split() in the correct version below.
sent_ints = [[word_map[w] for w in sent] for sent in sentences]

should be:

sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
Share:
15,673
Timothy Rajan
Author by

Timothy Rajan

Updated on June 21, 2022

Comments

  • Timothy Rajan
    Timothy Rajan almost 2 years

    I am working on a sentence classification problem and try to solve using Keras. The total unique words in the vocabulary is 36.

    In this case, the total vocab is [W1,W2,W3....W36]

    So, if I have a sentence with words as [W1 W2 W6 W7 W9], if I encode it, I get a numpy array which is like below

    [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
     [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
     [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
     [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
    

    and the shape is (5,36)

    I am stuck from here. All, I have generated is 20000 numpy arrays with varying shapes i.e. (N,36) Where N is the number of words in a sentence. So, I have 20,000 sentences for training and 100 for test and all the sentences are labelled with (1,36) one-hot encoding

    I have x_train, x_test, y_train and y_test

    x_test and y_test are of dimension (1,36)

    Can anyone please advise how do I do it?

    I did some of the below coding

    model = Sequential()
    model.add(Dense(512, input_shape=(??????))),
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
    

    Any help would be much appreciated.

    UPDATE and RESPONSE TO @putonspectacles

    Thank you very much for the time and effort for the detailed response. I tried your code with some minor modification which I believe needs to be done for the code to work. Please find it below

    num_classes = 5 
    max_words = 20
    sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]
    labels = np.random.randint(0, num_classes, 3)
    y = to_categorical(labels, num_classes=num_classes)
    words = set(w for sent in sentences for w in sent.split())
    word_map = {w : i+1 for (i, w) in enumerate(words)}
    #-Changed the below line the inner for loop sent to sent.split()  
    sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
    vocab_size = len(words)
    print(vocab_size)
    #-changed the below line - the outer for loop sentences to sent_ints
    X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1)  for sent in sent_ints])
    print(X)
    print(y)
    model = Sequential()
    model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
    model.add(LSTM(128))
    model.add(Dense(5, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])
    model.fit(X,y)
    

    Without these changes the code doesnt work. When I run the above code, I get proper embeddings printed like below

    [[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
    [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
    [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
    
    
    [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]]
    
    
     [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
     [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
     [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]]
    
    
    
    [[0. 0. 0. 0. 1.]
    [1. 0. 0. 0. 0.]
    [0. 1. 0. 0. 0.]]
    

    But the error I am getting is "Error when checking input: expected dense_44_input to have 3 dimensions, but got array with shape (3, 1, 20, 16)"

    When I change the input shape to model.add(Dense(512, input_shape=(None,max_words, vocab_size + 1)))

    I get the error "Input 0 is incompatible with layer lstm_27: expected ndim=3, found ndim=4"

    I am working on resolving this issue. If you can give me a direction, that would be great.

    I have accepted the answer because it answers the objective of embedding the words. Thanks again.

  • Timothy Rajan
    Timothy Rajan about 6 years
    thanks a lot for your code and detailed answer. It solved my issue. But I am not able to compile the model, Working on it. I have edited the code which you have provided. Hope the edit is correct.
  • parsethis
    parsethis about 6 years
    Hmmm X should be a 3D array. after padding and encoding. I'll check my code
  • Timothy Rajan
    Timothy Rajan about 6 years
    still getting the same error. Input 0 is incompatible with layer lstm_28: expected ndim=3, found ndim=4
  • parsethis
    parsethis about 6 years
    Does your sent_ints look like this: [[12, 5, 4, 1, 7, 13], [12, 8, 4, 6], [3, 11, 15, 9, 14, 10, 7, 2, 15]]
  • Timothy Rajan
    Timothy Rajan about 6 years
    yes: [[3, 15, 4, 13, 12, 6], [3, 2, 7], [8, 5, 1, 10, 9, 11, 12, 14, 1]]
  • Timothy Rajan
    Timothy Rajan about 6 years
    As I mentioned when the model layer is like this "model.add(Dense(512, input_shape=(max_words, vocab_size + 1))) " Error is when checking input: expected dense_52_input to have 3 dimensions, but got array with shape (3, 1, 20, 16)
  • Timothy Rajan
    Timothy Rajan about 6 years
    When the when the model layer is like this "model.add(Dense(None,512, input_shape=(max_words, vocab_size + 1))) "Error is Input 0 is incompatible with layer lstm_28: expected ndim=3, found ndim=4
  • parsethis
    parsethis about 6 years
    shoot me an email orson[dot]network[at]gmail[dot]com
  • Timothy Rajan
    Timothy Rajan about 6 years
    The below line solved the issue mentioned in the Edited question.
  • Timothy Rajan
    Timothy Rajan about 6 years
    X = np.array([to_categorical((sequence.pad_sequences((sent,), max_words)).reshape(20,),vocab_size + 1) for sent in sent_ints])
  • matt
    matt almost 5 years
    I've updated this line, now it works: sent_ints = [[word_map[w] for w in sent.split(" ")] for sent in sentences]