How to convert a list of strings into a tensor in pytorch?

44,271

Solution 1

Unfortunately, you can't right now. And I don't think it is a good idea since it will make PyTorch clumsy. A popular workaround could convert it into numeric types using sklearn.

Here is a short example:

from sklearn import preprocessing
import torch

labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
le = preprocessing.LabelEncoder()
targets = le.fit_transform(labels)
# targets: array([0, 1, 2, 3])

targets = torch.as_tensor(targets)
# targets: tensor([0, 1, 2, 3])

Since you may need the conversion between true labels and transformed labels, it is good to store the variable le.

Solution 2

The trick is first to find out max length of a word in the list, and then at the second loop populate the tensor with zeros padding. Note that utf8 strings take two bytes per char.

In[]
import torch

words = ['שלום', 'beautiful', 'world']
max_l = 0
ts_list = []
for w in words:
    ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
    max_l = max(ts_list[-1].size()[0], max_l)

w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
for i, ts in enumerate(ts_list):
    w_t[i, 0:ts.size()[0]] = ts
w_t

Out[]
tensor([[215, 169, 215, 156, 215, 149, 215, 157,   0],
        [ 98, 101,  97, 117, 116, 105, 102, 117, 108],
        [119, 111, 114, 108, 100,   0,   0,   0,   0]], dtype=torch.uint8)

Solution 3

If you don't want to use sklearn, another solution could be to keep your original list and create an extra indices list, which you can use to refer back to your original values afterwards. I specifically needed this, when I had to keep track of my original string, while batching the tokenized string.

Example below:

labels = ['cat', 'dog', 'mouse']
sentence_idx = np.linspace(0,len(labels), len(labels), False)
# [0, 1, 2]
torch_idx = torch.tensor(sentence_idx)
# do what ever you would like from torch eg. pass it to a dataloader
dataset = TensorDataset(torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)
for batch in iter(loader):
    print(batch[0])
    print(labels[int(batch[0].item())])

# output:
# tensor([0.], dtype=torch.float64)
# cat
# tensor([1.], dtype=torch.float64)
# dog
# tensor([2.], dtype=torch.float64)
# mouse

For my specific use case, the code looked like this:

input_ids, attention_masks, labels = tokenize_sentences(tokenizer, sentences, labels, max_length)

# create a indexes tensor to keep track of original sentence index
sentence_idx = np.linspace(0,len(sentences), len(sentences),False )
torch_idx = torch.tensor(sentence_idx)
dataset = TensorDataset(input_ids, attention_masks, labels, torch_idx)
loader = DataLoader(dataset, batch_size=1, shuffle=True)

for batch in loader:
    _, logit = model(batch[0], 
                     token_type_ids=None,
                     attention_mask=batch[1],
                     labels=batch[2])

    pred_flat = np.argmax(logit.detach(), axis=1).flatten()
    print(pred_flat)
    print(batch[2])
    if pred_flat == batch[2]:
        print("\nThe following sentence was predicted correctly:")
            print(sentences[int(batch[3].item())])
Share:
44,271

Related videos on Youtube

deepayan das
Author by

deepayan das

Updated on October 30, 2020

Comments

  • deepayan das
    deepayan das over 3 years

    I am working on classification problem in which I have a list of strings as class labels and I want to convert them into a tensor. So far I have tried converting the list of strings into a numpy array using the np.array function provided by the numpy module.

    truth = torch.from_numpy(np.array(truths))

    but I am getting the following error.

    RuntimeError: can't convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.

    Can anybody suggest an alternative approach? Thanks

    • gokul_uf
      gokul_uf almost 7 years
      taking the ASCII value / UNICODE value of the characters could be a workaround (ASCII would fit in uint8)
    • deepayan das
      deepayan das almost 7 years
      okay thanks will try that
    • BiBi
      BiBi almost 7 years
      What about simply converting your string labels into digits or one-hot vectors?
    • cleros
      cleros almost 7 years
      I agree with @BiBi - it sounds like you want a one-hot encoding.
    • deepayan das
      deepayan das almost 7 years
      Yeah, I am using one hot encoding, Thanks @BiBi