How to iterate over two dataloaders simultaneously using pytorch?

22,595

Solution 1

I see you are struggling to make a right dataloder function. I would do:

class Siamese(Dataset):


    def __init__(self, transform=None):
    
       #init data here
    
    def __len__(self):
        return   #length of the data

    def __getitem__(self, idx):
        #get images and labels here 
        #returned images must be tensor
        #labels should be int 
        return img1, img2 , label1, label2 

Solution 2

Further to what it is already mentioned, cycle() and zip() might create a memory leakage problem - especially when using image datasets! To solve that, instead of iterating like this:

dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10

for epoch in range(num_epochs):

    for i, (data1, data2) in enumerate(zip(cycle(dataloaders1), dataloaders2)):
        
        do_cool_things()

you could use:

dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10

for epoch in range(num_epochs):
    dataloader_iterator = iter(dataloaders1)
    
    for i, data1 in enumerate(dataloaders2)):

        try:
            data2 = next(dataloader_iterator)
        except StopIteration:
            dataloader_iterator = iter(dataloaders1)
            data2 = next(dataloader_iterator)

        do_cool_things()

Bear in mind that if you use labels as well, you should replace in this example data1 with (inputs1,targets1) and data2 with inputs2,targets2, as @Sajad Norouzi said.

KUDOS to this one: https://github.com/pytorch/pytorch/issues/1917#issuecomment-433698337

Solution 3

To complete @ManojAcharya's answer:

The error you are getting comes neither from zip() nor DataLoader() directly. Python is trying to tell you that it couldn't find one of the data files you are asking for (c.f. FileNotFoundError in the exception trace), probably in your Dataset.

Find below a working example using DataLoader and zip together. Note that if you want to shuffle your data, it becomes difficult to keep the correspondences between the 2 datasets. This justifies @ManojAcharya's solution.

import torch
from torch.utils.data import DataLoader, Dataset

class DummyDataset(Dataset):
    """
    Dataset of numbers in [a,b] inclusive
    """

    def __init__(self, a=0, b=100):
        super(DummyDataset, self).__init__()
        self.a = a
        self.b = b

    def __len__(self):
        return self.b - self.a + 1

    def __getitem__(self, index):
        return index, "label_{}".format(index)

dataloaders1 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)

for i, data in enumerate(zip(dataloaders1, dataloaders2)):
    print(data)
# ([tensor([ 4,  7]), ('label_4', 'label_7')], [tensor([ 8,  5]), ('label_8', 'label_5')])
# ([tensor([ 1,  9]), ('label_1', 'label_9')], [tensor([ 6,  9]), ('label_6', 'label_9')])
# ([tensor([ 6,  5]), ('label_6', 'label_5')], [tensor([ 0,  4]), ('label_0', 'label_4')])
# ([tensor([ 8,  2]), ('label_8', 'label_2')], [tensor([ 2,  7]), ('label_2', 'label_7')])
# ([tensor([ 0,  3]), ('label_0', 'label_3')], [tensor([ 3,  1]), ('label_3', 'label_1')])

Solution 4

If you want to iterate over two datasets simultaneously, there is no need to define your own dataset class just use TensorDataset like below:

dataset = torch.utils.data.TensorDataset(dataset1, dataset2)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
for index, (xb1, xb2) in enumerate(dataloader):
    ....

If you want the labels or iterating over more than two datasets just feed them as an argument to the TensorDataset after dataset2.

Solution 5

Adding on @Aldream's solution for the case when we have varying length of the dataset and if we want to pass through them all at same epoch then we could use the cycle() from itertools, a Python Standard library. Using the code snippet of @Aldrem, the updated code will look like:

from torch.utils.data import DataLoader, Dataset
from itertools import cycle

class DummyDataset(Dataset):
    """
    Dataset of numbers in [a,b] inclusive
    """

    def __init__(self, a=0, b=100):
        super(DummyDataset, self).__init__()
        self.a = a
        self.b = b

    def __len__(self):
        return self.b - self.a + 1

    def __getitem__(self, index):
        return index

dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10

for epoch in range(num_epochs):
    for i, data in enumerate(zip(cycle(dataloaders1), dataloaders2)):
        print(data)

With only zip() the iterator will be exhausted when the length is equal to that of the smallest dataset (here 100). But with the use of cycle(), we will repeat the smallest dataset again unless our iterator looks at all the samples from the largest dataset (here 200).

P.S. One can always argue this approach may not be required to achieve convergence as long as one does samples randomly but with this approach, the evaluation might be easier.

Share:
22,595
aa1
Author by

aa1

Updated on December 09, 2021

Comments

  • aa1
    aa1 over 2 years

    I am trying to implement a Siamese network that takes in two images. I load these images and create two separate dataloaders.

    In my loop I want to go through both dataloaders simultaneously so that I can train the network on both images.

    for i, data in enumerate(zip(dataloaders1, dataloaders2)):
    
        # get the inputs
        inputs1 = data[0][0].cuda(async=True);
        labels1 = data[0][1].cuda(async=True);
    
        inputs2 = data[1][0].cuda(async=True);
        labels2 = data[1][1].cuda(async=True);
    
        labels1 = labels1.view(batchSize,1)
        labels2 = labels2.view(batchSize,1)
    
        # zero the parameter gradients
        optimizer.zero_grad()
    
        # forward + backward + optimize
        outputs1 = alexnet(inputs1)
        outputs2 = alexnet(inputs2)
    

    The return value of the dataloader is a tuple. However, when I try to use zip to iterate over them, I get the following error:

    OSError: [Errno 24] Too many open files
    Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f2d3c00c190>> ignored                           
    

    Shouldn't zip work on all iterable items? But it seems like here I can't use it on dataloaders.

    Is there any other way to pursue this? Or am I approaching the implementation of a Siamese network incorrectly?

  • aa1
    aa1 almost 6 years
    since this class's __getitem__member gets called with argument idx, with two images would it need to get called with two indices? (self,idx1,idx2)? or would I simply just return two random images? For example, img1 = os.path.join(self.root_dir,imgNames[idx]) img2 = os.path.join(self.root_dir,imgNames[idx + 1]) ?
  • macharya
    macharya almost 6 years
    I guess you have equal number of same images ( in two arrays).. so using one index you could load the images from two arrays of images. ... or you could make another array of just indices of same images and load those images using idx of getitem of dataloader
  • Priyanka Chaudhary
    Priyanka Chaudhary over 4 years
    I am using the same approach but then how do you calculate dataset length for dataloaders1 to calculate loss? Thanks!
  • Chaitanya Patel
    Chaitanya Patel over 4 years
    Correct me if I am wrong but I think using zip would be very inefficient in many cases. Some datasets have huge number of images which are loaded at runtime. Zip will iterate over entire dataset causing it to load images as once.
  • Subtain Malik
    Subtain Malik over 4 years
    Thanks, it is the most relevant and easy-to-go answer
  • Alter
    Alter about 3 years
    Great answer for when you need different batch sizes from each dataset
  • nour
    nour over 2 years
    @afroditi if i am using image and labels from first dataloader and mask and labels from second dataloader, how ca i procees them?
  • afroditi
    afroditi about 2 years
    @nour It would be hard to do that during the training process using shuffle=True option. You can pre-process the data accordingly to create a dataloader giving (image, label, mask) simultaneously, given that the labels are used for mapping. Else if labels from dataset 1 are different from dataset 2 you could do create again a single dataloader providing batches of (image, mask,label1,label 2).