How do I load in the MNIST digits and label data in MATLAB?

15,996

I am the original author of Method #1 that you spoke of. The process to read in the training data and test labels is quite simple. In terms of reading in images, the code that you showed above reads the files perfectly and is in a cell array format. However, you are missing reading in the number of images, rows and columns inside the file. Take note that the MNIST format for this file is in the following fashion. The left column is the offset in bytes you are referencing with respect to the beginning:

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

The first four bytes are a magic number: 2051 to ensure that you're reading in the file properly. The next four bytes denote the total number of images, then the next four bytes are the rows and finally the next four bytes are the columns. There should be 60000 images of size 28 rows by 28 columns. After this, the pixels are interleaved in row major format so you have to loop over series of 28 x 28 pixels and store them. In this case, I've stored them in a cell array and each element in this cell array would be one digit. The same format is for the test data as well, but there are 10000 images instead.

As for the actual labels, it's roughly the same format but there are some slight differences:

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label

The first four bytes are a magic number: 2049, then the second set of four bytes tells you how many labels there are and finally there is exactly 1 byte for each corresponding digit in the dataset. The test data is also the same format but there are 10000 labels. As such, once you read in the necessary data in the label set, you just need one fread call and ensure that the data is unsigned 8-bit integer to read in the rest of the labels.

Now the reason why you have to use swapbytes is because MATLAB will read in the data in little-endian format, meaning that the least significant byte from a set of bytes is read in first. You use swapbytes to rearrange this order when you're done.

As such, I have modified this code for you so that it's an actual function that takes in a set of two strings: The full path to the image file of digits and the full path to the digits. I have also changed the code so that the images are a 3D numeric matrix as opposed to a cell array for faster processing. Take note that when you start reading in the actual image data, each pixel is unsigned 8-bit integer, so there's no need to do any swapping of bytes. This was only required when reading in multiple bytes in one fread call:

function [images, labels] = mnist_parse(path_to_digits, path_to_labels)

% Open files
fid1 = fopen(path_to_digits, 'r');

% The labels file
fid2 = fopen(path_to_labels, 'r');

% Read in magic numbers for both files
A = fread(fid1, 1, 'uint32');
magicNumber1 = swapbytes(uint32(A)); % Should be 2051
fprintf('Magic Number - Images: %d\n', magicNumber1);

A = fread(fid2, 1, 'uint32');
magicNumber2 = swapbytes(uint32(A)); % Should be 2049
fprintf('Magic Number - Labels: %d\n', magicNumber2);

% Read in total number of images
% Ensure that this number matches with the labels file
A = fread(fid1, 1, 'uint32');
totalImages = swapbytes(uint32(A));
A = fread(fid2, 1, 'uint32');
if totalImages ~= swapbytes(uint32(A))
    error('Total number of images read from images and labels files are not the same');
end
fprintf('Total number of images: %d\n', totalImages);

% Read in number of rows
A = fread(fid1, 1, 'uint32');
numRows = swapbytes(uint32(A));

% Read in number of columns
A = fread(fid1, 1, 'uint32');
numCols = swapbytes(uint32(A));

fprintf('Dimensions of each digit: %d x %d\n', numRows, numCols);

% For each image, store into an individual slice
images = zeros(numRows, numCols, totalImages, 'uint8');
for k = 1 : totalImages
    % Read in numRows*numCols pixels at a time
    A = fread(fid1, numRows*numCols, 'uint8');

    % Reshape so that it becomes a matrix
    % We are actually reading this in column major format
    % so we need to transpose this at the end
    images(:,:,k) = reshape(uint8(A), numCols, numRows).';
end

% Read in the labels
labels = fread(fid2, totalImages, 'uint8');

% Close the files
fclose(fid1);
fclose(fid2);

end

To call this function, simply specify the path to both the image file and the labels file. Assuming you are running this file in the same directory where the files are located, you would do one of the following for the training images:

[images, labels] = mnist_parse('train-images-idx3-ubyte', 'train-labels-idx1-ubyte');

Also, you would do the following for the test images:

[images, labels] = mnist_parse('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte');

To access the kth digit, you would simply do:

digit = images(:,:,k);

The corresponding label for the kth digit would be:

lbl = label(k);

To finally get this data into a format that is acceptable for that code that I have seen on Github, they assume that the rows correspond to training examples and the columns correspond to features. If you wish to have this format, simply reshape the data so that the image pixels are spread out over the columns.

Therefore, just do this:

[trainingdata, traingnd] = mnist_parse('train-images-idx3-ubyte', 'train-labels-idx1-ubyte');
trainingdata = double(reshape(trainingdata, size(trainingdata,1)*size(trainingdata,2), []).');
traingnd = double(traingnd);

[testdata, testgnd] = mnist_parse('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte');
testdata = double(reshape(testdata, size(testdata,1)*size(testdata_data,2), []).');
testgnd = double(testgnd);

The above uses the same variables as in the script so you should be able to plug this in and it should work. The second line reshapes the matrix so that each digit is in a column, but we need to transpose this so that each digit is in a column. We also need to cast to double as that is what the Github code is doing. The same logic is applied to the test data. Also take note that I've explicitly cast the training and test labels to double to ensure maximum compatibility in whatever algorithms you decide to use on this data.


Happy digit hacking!

Share:
15,996
SKM
Author by

SKM

Updated on June 04, 2022

Comments

  • SKM
    SKM almost 2 years

    I am trying to run the code given in the link

    https://github.com/bd622/DiscretHashing

    Discrete Hashing is a method for dimensionality reduction that is used on approximate nearest neighbor search. I want to load in the implementation on the MNIST database that is available in http://yann.lecun.com/exdb/mnist/. I have extracted the files from their compressed gz format.

    PROBLEM 1 :

    Using the solution to read MNIST database provided in Reading MNIST Image Database binary file in MATLAB

    I am getting the following error:

    Error using fread
    Invalid file identifier.  Use fopen to generate a valid file identifier.
    
    Error in Reading (line 7)
    A = fread(fid, 1, 'uint32');
    

    Here is the code:

    clear all;
    close all;
    
    %//Open file
    fid = fopen('t10k-images-idx3-ubyte', 'r');
    
    A = fread(fid, 1, 'uint32');
    magicNumber = swapbytes(uint32(A));
    
    %//For each image, store into an individual cell
    imageCellArray = cell(1, totalImages);
    for k = 1 : totalImages
        %//Read in numRows*numCols pixels at a time
        A = fread(fid, numRows*numCols, 'uint8');
        %//Reshape so that it becomes a matrix
        %//We are actually reading this in column major format
        %//so we need to transpose this at the end
        imageCellArray{k} = reshape(uint8(A), numCols, numRows)';
    end
    
    %//Close the file
    fclose(fid);
    

    UPDATE : Problem 1 solved and the revised code is

    clear all;
    close all;
    
    %//Open file
    fid = fopen('t10k-images.idx3-ubyte', 'r');
    
    A = fread(fid, 1, 'uint32');
    magicNumber = swapbytes(uint32(A));
    
    %//Read in total number of images
    %//A = fread(fid, 4, 'uint8');
    %//totalImages = sum(bitshift(A', [24 16 8 0]));
    
    %//OR
    A = fread(fid, 1, 'uint32');
    totalImages = swapbytes(uint32(A));
    
    %//Read in number of rows
    %//A = fread(fid, 4, 'uint8');
    %//numRows = sum(bitshift(A', [24 16 8 0]));
    
    %//OR
    A = fread(fid, 1, 'uint32');
    numRows = swapbytes(uint32(A));
    
    %//Read in number of columns
    %//A = fread(fid, 4, 'uint8');
    %//numCols = sum(bitshift(A', [24 16 8 0]));
    
    %// OR
    A = fread(fid, 1, 'uint32');
    numCols = swapbytes(uint32(A));
    
    for k = 1 : totalImages
        %//Read in numRows*numCols pixels at a time
        A = fread(fid, numRows*numCols, 'uint8');
        %//Reshape so that it becomes a matrix
        %//We are actually reading this in column major format
        %//so we need to transpose this at the end
        imageCellArray{k} = reshape(uint8(A), numCols, numRows)';
    end
    
    %//Close the file
    fclose(fid);
    

    PROBLEM 2:

    I cannot understand how to apply the 4 files of MNIST in the code. The code contains variables

    traindata = double(traindata);
    testdata = double(testdata);
    

    How do I prepare the MNIST database so that I can apply to the implementation?

    UPDATE : I implemented the solution but I keep getting this error

    Error using fread
    Invalid file identifier.  Use fopen to generate a valid file identifier.
    
    Error in mnist_parse (line 11)
    A = fread(fid1, 1, 'uint32');
    

    These are the files

    demo.m % this is the main file that calls the function to read in the MNIST data

    clear all
    clc
    [Trainimages, Trainlabels] = mnist_parse('C:\Users\Desktop\MNIST\train-images-idx3-ubyte', 'C:\Users\Desktop\MNIST\train-labels-idx1-ubyte');
    
    [Testimages, Testlabels] = mnist_parse('t10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte');
    
    k=5;
    digit = images(:,:,k);
    lbl = label(k);
    

     function [images, labels] = mnist_parse(path_to_digits, path_to_labels)
    
    % Open files
    fid1 = fopen(path_to_digits, 'r');
    
    % The labels file
    fid2 = fopen(path_to_labels, 'r');
    
    % Read in magic numbers for both files
    A = fread(fid1, 1, 'uint32');
    magicNumber1 = swapbytes(uint32(A)); % Should be 2051
    fprintf('Magic Number - Images: %d\n', magicNumber1);
    
    A = fread(fid2, 1, 'uint32');
    magicNumber2 = swapbytes(uint32(A)); % Should be 2049
    fprintf('Magic Number - Labels: %d\n', magicNumber2);
    
    % Read in total number of images
    % Ensure that this number matches with the labels file
    A = fread(fid1, 1, 'uint32');
    totalImages = swapbytes(uint32(A));
    A = fread(fid2, 1, 'uint32');
    if totalImages ~= swapbytes(uint32(A))
        error('Total number of images read from images and labels files are not the same');
    end
    fprintf('Total number of images: %d\n', totalImages);
    
    % Read in number of rows
    A = fread(fid1, 1, 'uint32');
    numRows = swapbytes(uint32(A));
    
    % Read in number of columns
    A = fread(fid1, 1, 'uint32');
    numCols = swapbytes(uint32(A));
    
    fprintf('Dimensions of each digit: %d x %d\n', numRows, numCols);
    
    % For each image, store into an individual slice
    images = zeros(numRows, numCols, totalImages, 'uint8');
    for k = 1 : totalImages
        % Read in numRows*numCols pixels at a time
        A = fread(fid1, numRows*numCols, 'uint8');
    
        % Reshape so that it becomes a matrix
        % We are actually reading this in column major format
        % so we need to transpose this at the end
        images(:,:,k) = reshape(uint8(A), numCols, numRows).';
    end
    
    % Read in the labels
    labels = fread(fid2, totalImages, 'uint8');
    
    % Close the files
    fclose(fid1);
    fclose(fid2);
    
    end
    
  • SKM
    SKM over 7 years
    Thank you so much for your detailed explanation. I could not log into my Stackoverflow account due to some glitch, that is the reason I could not check your answer. So, I ran your code step by step but Matlab throws the error : Error using fread Invalid file identifier. Use fopen to generate a valid file identifier. Error in mnist_parse (line 11) A = fread(fid1, 1, 'uint32'); Error in demo (line 3) [Trainimages, Trainlabels] = mnist_parse('C:\Users\Desktop\MNIST\train-images-idx3-ubyte'‌​, 'C:\Users\Desktop\MNIST\train-labels-idx
  • SKM
    SKM over 7 years
    Based on your observation that I missed reading in the number of totalImages, rows and columns, I have corrected that part. I have corrected this in my Question. However, I am unabel to mitigate this new error that appears when I implemented your solution. I have not used the last paragraph of the code that I should be applying to the program in GitHub. Could you please let me know what I should do so that the error goes away?
  • rayryeng
    rayryeng over 7 years
    It's not working because your path is incorrect. In between Users and Desktop in your path should be your username. There is no such directory. fopen works when a valid path for a file has been provided and you haven't done that. Please make sure the path to your file is absolutely correct... Or place the script in the same directory as the MNIST data and use local paths.
  • SKM
    SKM over 7 years
    The file names after unzipping were different from the ones that you had in your answer. :) After carefully going through the file names, I saw train-images.idx3-ubyte instead. Same thing for the others. So, I am not getting that error. However, there is a new problem which is that the GitHub code uses the cateTrainTest which is avaialble with the cifar_10_gist database. This file is used in the line [Pre, Rec] = evaluate_macro(cateTrainTest, Ret).
  • SKM
    SKM over 7 years
    Basically, the elements indicate the similarity : 0 indicates if the two data points are similar or not and 1 for dissimilar, if I am not wrong. Do you know where I can find this datafile for MNIST database? Or if you could help with some other hack so that the MNIST database can be used?
  • rayryeng
    rayryeng over 7 years
    I don't know what that line is doing. MNIST only contains digits and expected labels. That's all. BTW thanks for the accept!