PyTorch Datasets: Converting entire Dataset to NumPy
Solution 1
If I understand you correctly, you want to get the whole train dataset of MNIST images (in total 60000 images, each image of size 1x28x28 array with 1 for color channel) as a numpy array of size (60000, 1, 28, 28)?
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# Transform to normalized Tensors
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST('./MNIST/', train=True, transform=transform, download=True)
# test_dataset = datasets.MNIST('./MNIST/', train=False, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=len(train_dataset))
# test_loader = DataLoader(test_dataset, batch_size=len(test_dataset))
train_dataset_array = next(iter(train_loader))[0].numpy()
# test_dataset_array = next(iter(test_loader))[0].numpy()
This is the result:
>>> train_dataset_array
array([[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]],
[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]],
[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]],
...,
[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]],
[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]],
[[[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
...,
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296],
[-0.42421296, -0.42421296, -0.42421296, ..., -0.42421296,
-0.42421296, -0.42421296]]]], dtype=float32)
Solution 2
There is no need to use torch.utils.data.DataLoader
for this task.
from torchvision import datasets, transforms
train_set = datasets.MNIST('./data', train=True, download=True)
test_set = datasets.MNIST('./data', train=False, download=True)
train_set_array = train_set.data.numpy()
test_set_array = test_set.data.numpy()
Note that in this case targets are excluded.
Andrew Fan
I am an aspiring software and game developer. Primary skills involve managing and organizing smaller scale project and teams, and loving JavaScript except when I learn to hate it. I also do graphic design and have an interest in public transportation infrastructure.
Updated on June 23, 2022Comments
-
Andrew Fan almost 2 years
I'm trying to convert the Torchvision MNIST train and test datasets into NumPy arrays but can't find documentation to actually perform the conversion.
My goal would be to take an entire dataset and convert it into a single NumPy array, preferably without iterating through the entire dataset.
I've looked at How do I turn a Pytorch Dataloader into a numpy array to display image data with matplotlib? but it doesn't address my issue.
So my question is, utilizing
torch.utils.data.DataLoader
, how would I go about converting the datasets (train/test) into two NumPy arrays such that all of the examples are present?Note: I've left the batch size as the default of 1 for now; I could set it to 60,000 for train and 10,000 for test, but I'd prefer to not use magic numbers of that sort.
Thank you.
-
Andrew Fan about 5 yearsI assume that I can use
next(iter(train_loader))[1].numpy()
to get the labels? -
Andreas K. about 5 yearsYes, it gives the corresponding labels.
-
kyriakosSt over 4 yearsthat's very helpful, but does it give you the labels as well?
-
damavrom over 4 yearsThere is no practical reason to include labels in the same matrix of the data. You can acquire another NumPy matrix of the targets running
train_set_array_targets = train_set.targets.numpy()
-
kyriakosSt over 4 yearsThat is very much true. I was just missing the
data
andtargets
attributes ofDataset
objects - which I could not seem to find in the documentation for some reason. -
damavrom over 4 yearsThis is because
data
andtargets
are not attributes of theDataset
class but of theMNIST
class that subclasses fromDataset
. I would direct to the documentation ofMNIST
oftorchvision
package but I don't think there is anything that answers the question asked here. -
liang almost 4 yearsthis does not include the transform though, where the accepted answer does.