how to read Mat v7.3 files in python ?

11,121

Solution 1

Writing in Matlab:

test = {'Hello', 'world!'; 'Good', 'morning'; 'See', 'you!'};
save('data.mat', 'test', '-v7.3') % v7.3 so that it is readable by h5py

enter image description here

Reading in Python (works for any number or rows or columns, but assumes that each cell is a string):

import h5py
import numpy as np

data = []
with h5py.File("data.mat") as f:
    for column in f['test']:
        row_data = []
        for row_number in range(len(column)):            
            row_data.append(''.join(map(unichr, f[column[row_number]][:])))   
        data.append(row_data)

print data
print np.transpose(data)

Output:

[[u'Hello', u'Good', u'See'], [u'world!', u'morning', u'you!']]

[[u'Hello' u'world!']
 [u'Good' u'morning']
 [u'See' u'you!']]

Solution 2

import numpy as np
import cPickle as pickle
import h5py

f = h5py.File('train/digitStruct.mat')

metadata= {}
metadata['height'] = []
metadata['label'] = []
metadata['left'] = []
metadata['top'] = []
metadata['width'] = []

def print_attrs(name, obj):
    vals = []
        if obj.shape[0] == 1:
            vals.append(int(obj[0][0]))
        else:
            for k in range(obj.shape[0]):
                vals.append(int(f[obj[k][0]][0][0]))
        metadata[name].append(vals)

for item in f['/digitStruct/bbox']:
    f[item[0]].visititems(print_attrs)

with open('train_metadata.pickle','wb') as pf:
  pickle.dump(metadata, pf, pickle.HIGHEST_PROTOCOL)    

I modified it from https://discussions.udacity.com/t/how-to-deal-with-mat-files/160657/3. Honestly, I can't figure out exactly what's going on with visititmes(). The HDF5 file is just too hierarchy and too abstract.

This metadata is a dictionary. The content of each key is an embedded array. The array has 33402 items, which correspond to the png file with the name in order. Each item is an array with length from 1~6. I count the numbers of different digits, which is 5137, 18130, 8691, 1434, 9,1.

What surprises me is that the pickle file is only 9 MB, more than 20 times smaller than the mat file. I guess the HDS file sacrifice storing space for hierarchy structure.

Note: I have converted the values to integers, for the sake of slicing images. Now the train_metadata.pickle file has a size of only 2 MB, 100 times than the mat file.

Share:
11,121
user824624
Author by

user824624

Updated on June 13, 2022

Comments

  • user824624
    user824624 almost 2 years

    I am trying to read the mat file given in the following website, ufldl.stanford.edu/housenumbers, in the file train.tar.gz, there is a mat file named digitStruct.mat.

    when i used scipy.io to read the mat file, it alerts me with the message ' please use hdf reader for matlab v7.3 files '.

    the original matlab file is provided as below

    load digitStruct.mat
    for i = 1:length(digitStruct)
        im = imread([digitStruct(i).name]);
        for j = 1:length(digitStruct(i).bbox)
            [height, width] = size(im);
            aa = max(digitStruct(i).bbox(j).top+1,1);
            bb = min(digitStruct(i).bbox(j).top+digitStruct(i).bbox(j).height, height);
            cc = max(digitStruct(i).bbox(j).left+1,1);
            dd = min(digitStruct(i).bbox(j).left+digitStruct(i).bbox(j).width, width);
    
            imshow(im(aa:bb, cc:dd, :));
            fprintf('%d\n',digitStruct(i).bbox(j).label );
            pause;
        end
    end
    

    as shown above, the mat file has the key 'digitStruct', and within 'digitStruct', key 'name' and 'bbox' can be found, I used h5py API to read the file.

    import h5py
    f = h5py.File('train.mat')
    print len( f['digitStruct']['name'] ), len(f['digitStruct']['bbox']   )
    

    I can read the array, however when I loop though the array, how can I read each item?

    for i in f['digitStruct']['name']:
        print i # only print out the HDF5 ref