PyTorch Datasets: Converting entire Dataset to NumPy

11,143

Solution 1

If I understand you correctly, you want to get the whole train dataset of MNIST images (in total 60000 images, each image of size 1x28x28 array with 1 for color channel) as a numpy array of size (60000, 1, 28, 28)?

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Transform to normalized Tensors 
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST('./MNIST/', train=True, transform=transform, download=True)
# test_dataset = datasets.MNIST('./MNIST/', train=False, transform=transform, download=True)


train_loader = DataLoader(train_dataset, batch_size=len(train_dataset))
# test_loader = DataLoader(test_dataset, batch_size=len(test_dataset))

train_dataset_array = next(iter(train_loader))[0].numpy()
# test_dataset_array = next(iter(test_loader))[0].numpy()

This is the result:

>>> train_dataset_array

array([[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]],


       [[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]],


       [[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]],


       ...,


       [[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]],


       [[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]],


       [[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         ...,
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296],
         [-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
          -0.42421296, -0.42421296]]]], dtype=float32)

Solution 2

There is no need to use torch.utils.data.DataLoader for this task.

from torchvision import datasets, transforms

train_set = datasets.MNIST('./data', train=True, download=True)
test_set = datasets.MNIST('./data', train=False, download=True)

train_set_array = train_set.data.numpy()
test_set_array = test_set.data.numpy()

Note that in this case targets are excluded.

Share:
11,143
Andrew Fan
Author by

Andrew Fan

I am an aspiring software and game developer. Primary skills involve managing and organizing smaller scale project and teams, and loving JavaScript except when I learn to hate it. I also do graphic design and have an interest in public transportation infrastructure.

Updated on June 23, 2022

Comments

  • Andrew Fan
    Andrew Fan almost 2 years

    I'm trying to convert the Torchvision MNIST train and test datasets into NumPy arrays but can't find documentation to actually perform the conversion.

    My goal would be to take an entire dataset and convert it into a single NumPy array, preferably without iterating through the entire dataset.

    I've looked at How do I turn a Pytorch Dataloader into a numpy array to display image data with matplotlib? but it doesn't address my issue.

    So my question is, utilizing torch.utils.data.DataLoader, how would I go about converting the datasets (train/test) into two NumPy arrays such that all of the examples are present?

    Note: I've left the batch size as the default of 1 for now; I could set it to 60,000 for train and 10,000 for test, but I'd prefer to not use magic numbers of that sort.

    Thank you.

  • Andrew Fan
    Andrew Fan about 5 years
    I assume that I can use next(iter(train_loader))[1].numpy() to get the labels?
  • Andreas K.
    Andreas K. about 5 years
    Yes, it gives the corresponding labels.
  • kyriakosSt
    kyriakosSt over 4 years
    that's very helpful, but does it give you the labels as well?
  • damavrom
    damavrom over 4 years
    There is no practical reason to include labels in the same matrix of the data. You can acquire another NumPy matrix of the targets running train_set_array_targets = train_set.targets.numpy()
  • kyriakosSt
    kyriakosSt over 4 years
    That is very much true. I was just missing the data and targets attributes of Dataset objects - which I could not seem to find in the documentation for some reason.
  • damavrom
    damavrom over 4 years
    This is because data and targets are not attributes of the Dataset class but of the MNIST class that subclasses from Dataset. I would direct to the documentation of MNIST of torchvision package but I don't think there is anything that answers the question asked here.
  • liang
    liang almost 4 years
    this does not include the transform though, where the accepted answer does.