Can't load 'mnist-original' dataset using sklearn
Solution 1
I just faced the same issue and it took me some time to find the problem. One reason is, data can be corrupted during the first download. Remove the cached data. Find the scikit data home dir as follows:
from sklearn.datasets.base import get_data_home
print (get_data_home())
Clean the directory and redownload the dataset. This solution works for me. For reference: https://github.com/ageron/handson-ml/issues/143
This is also related with the following question: How to use datasets.fetch_mldata() in sklearn?
Solution 2
Unfortunately fetch_mldata() has been replaced in the latest version of sklearn as fetch_openml().
So, instead of using:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
You must use:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
x = mnist.data
y = mnist.target
shape of x will be = (70000,784)
shape of y will be = (70000,)
Solution 3
A quick update for the question here:
mldata.org seems to still be down. Then scikit-learn will remove fetch_mldata.
Solution for the moment: Since using the lines above will create a empty folder a the place of data_home, find the copy of the data here: https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat and download it. Then place it the ~/sklearn_data/mldata/ which is empty.
It worked for me.
Solution 4
Instead of :
from sklearn.datasets.mldata import fetch_mldata
use:
from sklearn.datasets import fetch_mldata
And then:
mnist = fetch_mldata('MNIST original')
X = mnist.data.astype('float64')
y = mnist.target
Please see this example:
Solution 5
For people having the same issue: it was a connection problem. If you get a similar error, check that you have the entire mnist-original.mat
file, as suggested by @vivek-kumar. Current file size: 55.4 MB.
Comments
-
albus_c about 2 years
This question is similar to what asked here and here. Unfortunately, in my case the suggested solution didn't fix the problem.
I need to work with the MNIST dataset but I can't fetch it, even if I specify the address of the
scikit_learn_data/mldata/
folder (see below). How can I fix this?In case it might help, I'm using Anaconda.
Code:
from sklearn.datasets.mldata import fetch_mldata dataset = fetch_mldata('mnist-original', data_home='/Users/michelangelo/scikit_learn_data/mldata/') mnist = fetch_mldata('MNIST original')
Error:
--------------------------------------------------------------------------- IOError Traceback (most recent call last) <ipython-input-5-dc4d45bc928e> in <module>() ----> 1 mnist = fetch_mldata('MNIST original') /Users/michelangelo/anaconda2/lib/python2.7/site-packages/sklearn/datasets/mldata.pyc in fetch_mldata(dataname, target_name, data_name, transpose_data, data_home) 168 # load dataset matlab file 169 with open(filename, 'rb') as matlab_file: --> 170 matlab_dict = io.loadmat(matlab_file, struct_as_record=True) 171 172 # -- extract data from matlab_dict /Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio.pyc in loadmat(file_name, mdict, appendmat, **kwargs) 134 variable_names = kwargs.pop('variable_names', None) 135 MR = mat_reader_factory(file_name, appendmat, **kwargs) --> 136 matfile_dict = MR.get_variables(variable_names) 137 if mdict is not None: 138 mdict.update(matfile_dict) /Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio5.pyc in get_variables(self, variable_names) 290 continue 291 try: --> 292 res = self.read_var_array(hdr, process) 293 except MatReadError as err: 294 warnings.warn( /Users/michelangelo/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio5.pyc in read_var_array(self, header, process) 250 `process`. 251 ''' --> 252 return self._matrix_reader.array_from_header(header, process) 253 254 def get_variables(self, variable_names=None): mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.array_from_header() mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.array_from_header() mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex() mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_numeric() mio5_utils.pyx in scipy.io.matlab.mio5_utils.VarReader5.read_element() streams.pyx in scipy.io.matlab.streams.FileStream.read_string() IOError: could not read bytes
-
seralouk over 6 yearsIf you type this: from sklearn.datasets import fetch_mldata, mnist = fetch_mldata('MNIST original') does it work ?
-
albus_c over 6 yearsNope, and I get
SyntaxError: invalid syntax
-
seralouk over 6 yearscopy paste each command separately. 1) from sklearn.datasets import fetch_mldata 2) mnist = fetch_mldata('MNIST original')
-
albus_c over 6 yearsUnfortunately that was not the problem.
-
seralouk over 6 yearswhat is your sklearn version ? use: import sklearn and sklearn.__version__ to print the version
-
albus_c over 6 yearsVersion number: '0.19.1'
-
seralouk over 6 yearsLet us continue this discussion in chat.
-
-
albus_c over 6 yearsThanks for the reply Vivek! I still get
IOError: could not read bytes
-
Vivek Kumar over 6 years@albus_c Possibly the download is corrupt. Please check the size of the downloaded file in
scikit_learn_data/mldata
. It should be at least 52 MB. If not, delete and try again. -
Vivek Kumar over 6 years@albus_c Precisely 52.9 MB. If still not successful, then please download the file from this link in a browser and replace the file in that folder.
-
albus_c over 6 yearsMost likely that was the problem. I will now try with the direct link.
-
Vivek Kumar over 6 yearsYes, in the system, it shows as 55.4 MB, but during download its shown as 52.9 MB. Please consider upvoting and accepting the answer if helped.
-
Eric over 5 yearsfetch_mldata is now deprecated : (
-
Nathan over 4 yearsHi Puneet! Could you add an explanation as to why this is the correct answer?
-
csaladenes about 4 yearsThis now throws
urlopen error [Errno -3] Temporary failure in name resolution