How to get mini-batches in pytorch in a clean and efficient way?

69,227

Solution 1

If I'm understanding your code correctly, your get_batch2 function appears to be taking random mini-batches from your dataset without tracking which indices you've used already in an epoch. The issue with this implementation is that it likely will not make use of all of your data.

The way I usually do batching is creating a random permutation of all the possible vertices using torch.randperm(N) and loop through them in batches. For example:

n_epochs = 100 # or whatever
batch_size = 128 # or whatever

for epoch in range(n_epochs):

    # X is a torch Variable
    permutation = torch.randperm(X.size()[0])

    for i in range(0,X.size()[0], batch_size):
        optimizer.zero_grad()

        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X[indices], Y[indices]

        # in case you wanted a semi-full example
        outputs = model.forward(batch_x)
        loss = lossfunction(outputs,batch_y)

        loss.backward()
        optimizer.step()

If you like to copy and paste, make sure you define your optimizer, model, and lossfunction somewhere before the start of the epoch loop.

With regards to your error, try using torch.from_numpy(np.random.randint(0,N,size=M)).long() instead of torch.LongTensor(np.random.randint(0,N,size=M)). I'm not sure if this will solve the error you are getting, but it will solve a future error.

Solution 2

Use data loaders.

Data Set

First you define a dataset. You can use packages datasets in torchvision.datasets or use ImageFolder dataset class which follows the structure of Imagenet.

trainset=torchvision.datasets.ImageFolder(root='/path/to/your/data/trn', transform=generic_transform)
testset=torchvision.datasets.ImageFolder(root='/path/to/your/data/val', transform=generic_transform)

Transforms

Transforms are very useful for preprocessing loaded data on the fly. If you are using images, you have to use the ToTensor() transform to convert loaded images from PIL to torch.tensor. More transforms can be packed into a composit transform as follows.

generic_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.ToPILImage(),
    #transforms.CenterCrop(size=128),
    transforms.Lambda(lambda x: myimresize(x, (128, 128))),
    transforms.ToTensor(),
    transforms.Normalize((0., 0., 0.), (6, 6, 6))
])

Data Loader

Then you define a data loader which prepares the next batch while training. You can set number of threads for data loading.

trainloader=torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8)
testloader=torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=8)

For training, you just enumerate on the data loader.

  for i, data in enumerate(trainloader, 0):
    inputs, labels = data    
    inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
    # continue training...

NumPy Stuff

Yes. You have to convert torch.tensor to numpy using .numpy() method to work on it. If you are using CUDA you have to download the data from GPU to CPU first using the .cpu() method before calling .numpy(). Personally, coming from MATLAB background, I prefer to do most of the work with torch tensor, then convert data to numpy only for visualisation. Also bear in mind that torch stores data in a channel-first mode while numpy and PIL work with channel-last. This means you need to use np.rollaxis to move the channel axis to the last. A sample code is below.

np.rollaxis(make_grid(mynet.ftrextractor(inputs).data, nrow=8, padding=1).cpu().numpy(), 0, 3)

Logging

The best method I found to visualise the feature maps is using tensor board. A code is available at yunjey/pytorch-tutorial.

Solution 3

Not sure what you were trying to do. W.r.t. batching you wouldn't have to convert to numpy. You could just use index_select() , e.g.:

for epoch in range(500):
    k=0
    loss = 0
    while k < X_mdl.size(0):

        random_batch = [0]*5
        for i in range(k,k+M):
            random_batch[i] = np.random.choice(N-1)
        random_batch = torch.LongTensor(random_batch)
        batch_xs = X_mdl.index_select(0, random_batch)
        batch_ys = y.index_select(0, random_batch)

        # Forward pass: compute predicted y using operations on Variables
        y_pred = batch_xs.mul(W)
        # etc..

The rest of the code would have to be changed as well though.


My guess, you would like to create a get_batch function that concatenates your X tensors and Y tensors. Something like:

def make_batch(list_of_tensors):
    X, y = list_of_tensors[0]
    # may need to unsqueeze X and y to get right dimensions
    for i, (sample, label) in enumerate(list_of_tensors[1:]):
        X = torch.cat((X, sample), dim=0)
        y = torch.cat((y, label), dim=0)
    return X, y

Then during training you select, e.g. max_batch_size = 32, examples through slicing.

for epoch:
  X, y = make_batch(list_of_tensors)
  X = Variable(X, requires_grad=False)
  y = Variable(y, requires_grad=False)

  k = 0   
   while k < X.size(0):
     inputs = X[k:k+max_batch_size,:]
     labels = y[k:k+max_batch_size,:]
     # some computation
     k+= max_batch_size

Solution 4

You can use torch.utils.data

assuming you have loaded the data from the directory, in train and test numpy arrays, you can inherit from torch.utils.data.Dataset class to create your dataset object

class MyDataset(Dataset):
    def __init__(self, x, y):
        super(MyDataset, self).__init__()
        assert x.shape[0] == y.shape[0] # assuming shape[0] = dataset size
        self.x = x
        self.y = y


    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, index):
        return self.x[index], self.y[index]

Then, create your dataset object

traindata = MyDataset(train_x, train_y)

Finally, use DataLoader to create your mini-batches

trainloader = torch.utils.data.DataLoader(traindata, batch_size=64, shuffle=True)

Solution 5

Create a class that is a subclass of torch.utils.data.Dataset and pass it to a torch.utils.data.Dataloader. Below is an example for my project.

class CandidateDataset(Dataset):
    def __init__(self, x, y):
        self.len = x.shape[0]
        if torch.cuda.is_available():
            device = 'cuda'
        else:
            device = 'cpu'
        self.x_data = torch.as_tensor(x, device=device, dtype=torch.float)
        self.y_data = torch.as_tensor(y, device=device, dtype=torch.long)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    def __len__(self):
        return self.len

def fit(self, candidate_count):
        feature_matrix = np.empty(shape=(candidate_count, 600))
        target_matrix = np.empty(shape=(candidate_count, 1))
        fill_matrices(feature_matrix, target_matrix)
        candidate_ds = CandidateDataset(feature_matrix, target_matrix)
        train_loader = DataLoader(dataset = candidate_ds, batch_size = self.BATCH_SIZE, shuffle = True)
        for epoch in range(self.N_EPOCHS):
            print('starting epoch ' + str(epoch))
            for batch_idx, (inputs, labels) in enumerate(train_loader):
                print('starting batch ' + str(batch_idx) + ' epoch ' + str(epoch))
                inputs, labels = Variable(inputs), Variable(labels)
                self.optimizer.zero_grad()
                inputs = inputs.view(1, inputs.size()[0], 600)
                # init hidden with number of rows in input
                y_pred = self.model(inputs, self.model.initHidden(inputs.size()[1]))
                labels.squeeze_()
                # labels should be tensor with batch_size rows. Column the index of the class (0 or 1)
                loss = self.loss_f(y_pred, labels)
                loss.backward()
                self.optimizer.step()
                print('done batch ' + str(batch_idx) + ' epoch ' + str(epoch))
Share:
69,227
Charlie Parker
Author by

Charlie Parker

CS and Maths are awesome!

Updated on July 09, 2022

Comments

  • Charlie Parker
    Charlie Parker almost 2 years

    I was trying to do a simple thing which was train a linear model with Stochastic Gradient Descent (SGD) using torch:

    import numpy as np
    
    import torch
    from torch.autograd import Variable
    
    import pdb
    
    def get_batch2(X,Y,M,dtype):
        X,Y = X.data.numpy(), Y.data.numpy()
        N = len(Y)
        valid_indices = np.array( range(N) )
        batch_indices = np.random.choice(valid_indices,size=M,replace=False)
        batch_xs = torch.FloatTensor(X[batch_indices,:]).type(dtype)
        batch_ys = torch.FloatTensor(Y[batch_indices]).type(dtype)
        return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
    
    def poly_kernel_matrix( x,D ):
        N = len(x)
        Kern = np.zeros( (N,D+1) )
        for n in range(N):
            for d in range(D+1):
                Kern[n,d] = x[n]**d;
        return Kern
    
    ## data params
    N=5 # data set size
    Degree=4 # number dimensions/features
    D_sgd = Degree+1
    ##
    x_true = np.linspace(0,1,N) # the real data points
    y = np.sin(2*np.pi*x_true)
    y.shape = (N,1)
    ## TORCH
    dtype = torch.FloatTensor
    # dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
    X_mdl = poly_kernel_matrix( x_true,Degree )
    X_mdl = Variable(torch.FloatTensor(X_mdl).type(dtype), requires_grad=False)
    y = Variable(torch.FloatTensor(y).type(dtype), requires_grad=False)
    ## SGD mdl
    w_init = torch.zeros(D_sgd,1).type(dtype)
    W = Variable(w_init, requires_grad=True)
    M = 5 # mini-batch size
    eta = 0.1 # step size
    for i in range(500):
        batch_xs, batch_ys = get_batch2(X_mdl,y,M,dtype)
        # Forward pass: compute predicted y using operations on Variables
        y_pred = batch_xs.mm(W)
        # Compute and print loss using operations on Variables. Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape (1,); loss.data[0] is a scalar value holding the loss.
        loss = (1/N)*(y_pred - batch_ys).pow(2).sum()
        # Use autograd to compute the backward pass. Now w will have gradients
        loss.backward()
        # Update weights using gradient descent; w1.data are Tensors,
        # w.grad are Variables and w.grad.data are Tensors.
        W.data -= eta * W.grad.data
        # Manually zero the gradients after updating weights
        W.grad.data.zero_()
    
    #
    c_sgd = W.data.numpy()
    X_mdl = X_mdl.data.numpy()
    y = y.data.numpy()
    #
    Xc_pinv = np.dot(X_mdl,c_sgd)
    print('J(c_sgd) = ', (1/N)*(np.linalg.norm(y-Xc_pinv)**2) )
    print('loss = ',loss.data[0])
    

    the code runs fine and all though my get_batch2 method seems really dum/naive, its probably because I am new to pytorch but I have not found a good place where they discuss how to retrieve data batches. I went through their tutorials (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html) and through the data set (http://pytorch.org/tutorials/beginner/data_loading_tutorial.html) with no luck. The tutorials all seem to assume that one already has the batch and batch-size at the beginning and then proceeds to train with that data without changing it (specifically look at http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-variables-and-autograd).

    So my question is do I really need to turn my data back into numpy so that I can fetch some random sample of it and then turn it back to pytorch with Variable to be able to train in memory? Is there no way to get mini-batches with torch?

    I looked at a few functions torch provides but with no luck:

    #pdb.set_trace()
    #valid_indices = torch.arange(0,N).numpy()
    #valid_indices = np.array( range(N) )
    #batch_indices = np.random.choice(valid_indices,size=M,replace=False)
    #indices = torch.LongTensor(batch_indices)
    #batch_xs, batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
    #batch_xs,batch_ys = torch.index_select(X_mdl, 0, indices), torch.index_select(y, 0, indices)
    

    even though the code I provided works fine I am worried that its not an efficient implementation AND that if I were to use GPUs that there would be a considerable further slow down (because my guess it putting things in memory and then fetching them back to put them GPU like that is silly).


    I implemented a new one based on the answer that suggested to use torch.index_select():

    def get_batch2(X,Y,M):
        '''
        get batch for pytorch model
        '''
        # TODO fix and make it nicer, there is pytorch forum question
        #X,Y = X.data.numpy(), Y.data.numpy()
        X,Y = X, Y
        N = X.size()[0]
        batch_indices = torch.LongTensor( np.random.randint(0,N+1,size=M) )
        pdb.set_trace()
        batch_xs = torch.index_select(X,0,batch_indices)
        batch_ys = torch.index_select(Y,0,batch_indices)
        return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
    

    however, this seems to have issues because it does not work if X,Y are NOT variables...which is really odd. I added this to the pytorch forum: https://discuss.pytorch.org/t/how-to-get-mini-batches-in-pytorch-in-a-clean-and-efficient-way/10322

    Right now what I am struggling with is making this work for gpu. My most current version:

    def get_batch2(X,Y,M,dtype):
        '''
        get batch for pytorch model
        '''
        # TODO fix and make it nicer, there is pytorch forum question
        #X,Y = X.data.numpy(), Y.data.numpy()
        X,Y = X, Y
        N = X.size()[0]
        if dtype ==  torch.cuda.FloatTensor:
            batch_indices = torch.cuda.LongTensor( np.random.randint(0,N,size=M) )# without replacement
        else:
            batch_indices = torch.LongTensor( np.random.randint(0,N,size=M) ).type(dtype)  # without replacement
        pdb.set_trace()
        batch_xs = torch.index_select(X,0,batch_indices)
        batch_ys = torch.index_select(Y,0,batch_indices)
        return Variable(batch_xs, requires_grad=False), Variable(batch_ys, requires_grad=False)
    

    the error:

    RuntimeError: tried to construct a tensor from a int sequence, but found an item of type numpy.int64 at index (0)
    

    I don't get it, do I really have to do:

    ints = [ random.randint(0,N) for i i range(M)]
    

    to get the integers?

    It would also be ideal if the data could be a variable. It seems that it torch.index_select does not work for Variable type data.

    this list of integers thing still doesn't work:

    TypeError: torch.addmm received an invalid combination of arguments - got (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor), but expected one of:
     * (torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (float beta, torch.cuda.FloatTensor source, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (float beta, torch.cuda.FloatTensor source, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
     * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
          didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
     * (float beta, torch.cuda.FloatTensor source, float alpha, torch.cuda.sparse.FloatTensor mat1, torch.cuda.FloatTensor mat2, *, torch.cuda.FloatTensor out)
          didn't match because some of the arguments have invalid types: (int, torch.cuda.FloatTensor, int, torch.cuda.FloatTensor, torch.FloatTensor, out=torch.cuda.FloatTensor)
    
  • jdhao
    jdhao over 6 years
    The tensor board link is broken and also you can use np.transpose() to convert from channel first to channel last representation.
  • Charlie Parker
    Charlie Parker over 6 years
    its really annoying, index_select() requires the data to not be a variable...why?
  • Charlie Parker
    Charlie Parker over 6 years
    if my data set is just a numpy array, how do I use your solution? I am confused sorry for the noob question.
  • Forcetti
    Forcetti over 6 years
    Probably because changing only parts of the data inside a Variable doesn't enable gradient calculation.
  • Charlie Parker
    Charlie Parker over 6 years
    but the data always has its requires_grad=False...how does that matter?
  • Charlie Parker
    Charlie Parker over 6 years
    how does torch.randperm(N) help?
  • saetch_g
    saetch_g over 6 years
    It helps in two ways. The first is that it ensures each data point in X is sampled in a single epoch. It is usually good to use of all of your data to help your model generalize. The second way it helps is that it is relatively simple to implement. You don't have to make an entire function like get_batch2().
  • Forcetti
    Forcetti over 6 years
    You're right, requires_grad is only a boolean that indicates whether the Variable has been created by a subgraph. The data Variable shouldn't require grad, because you will overwrite the original content anyway. Apparently, you can index_select a Variable with a Variable: discuss.pytorch.org/t/indexing-a-variable-with-a-variable/21‌​11
  • Charlie Parker
    Charlie Parker over 6 years
    Im confused about one thing whats the difference between index_select() vs just indexing directly X[k1:k2,:]? Like when would we use one vs the other?
  • Charlie Parker
    Charlie Parker over 6 years
    I wasn't aware people actually kept track of the indices they seen, is this standard practice? I thought it was just getting data without replacement what was the common practice, at least in neural nets, no?
  • saetch_g
    saetch_g over 6 years
    Yeah, the important parts are ensuring that data is not repeated in an epoch and all the data is used in each epoch. Otherwise the model might overfit to some particular data and could be worse at generalizing to unseen testing data. Tracking the indices is just a simple way to achieve this goal. Another approach would be to shuffle the data at the beginning of each epoch. Whatever works. It just looked like your example code was potentially reusing some data and neglecting other data within an epoch. Sorry if I misunderstood your code.
  • saetch_g
    saetch_g over 6 years
    One benefit of using index permutations is that you can use it no matter which framework you're using. Numpy has np.random.permutation() so it's easy to do if you're using tensorflow.
  • Charlie Parker
    Charlie Parker over 6 years
    My code is just sampling without replacement and has the repetition issue you pointed out. However, thats what I thought was the standard. I understand what disadvantages it has (as you said) but I thought regardless that thats what was used cuz with a massive data sets today its just too expensive to keep track of the indices and stuff...no?
  • saetch_g
    saetch_g over 6 years
    Good point, I didn't think of that. If you have a big enough dataset it probably doesn't matter too much. I suppose that's a tradeoff every dev has to make for their self.
  • information_interchange
    information_interchange over 5 years
    I just want to point out that this will discard any examples in the modulus of the batch_size. So if you have 10 examples, and set the batch size to 3, you won't utilize the last 1 example. Just something to be aware of
  • Sulphur
    Sulphur almost 4 years
    Sorry to be that person, but, what is X in your sample code? X.size()[0]