Why am I getting a data conversion warning?

10,989

I think scikit-learn expects y to be a 1-D array. Your labels variable is 2-D - labels.shape is (N, 1). The warning tells you to use labels.ravel(), which will turn labels into a 1-D array, with a shape of (N,).
Reshaping will also work:labels=labels.reshape((N,))
Come to think of it, so will calling squeeze:labels=labels.squeeze()

I guess the gotcha here is that in numpy, a 1-D array is different from a 2-D array with one of its dimensions equal to 1.

Share:
10,989
shmibloo
Author by

shmibloo

Updated on June 12, 2022

Comments

  • shmibloo
    shmibloo almost 2 years

    I am a relative newbie in this area so I would appreciate your help. I am playing around with the mnist dataset. I took the code from http://g.sweyla.com/blog/2012/mnist-numpy/ but changed "images" to be 2 dimensional so that every image will be a feature vector. Then I ran PCA on the data and then SVM and checked the score. Everything seems to work fine, but I am getting the following warning and I am not sure why.

    "DataConversionWarning: A column-vector y was passed when a 1d array was expected.\
    Please change the shape of y to (n_samples, ), for example using ravel()."
    

    I have tried several things but can't seem to get rid of this warning. Any suggestions? Here is the full code (ignore the missing indentations, seems like they got a little messed up copying the code here):

    import os, struct
    from array import array as pyarray
    from numpy import append, array, int8, uint8, zeros, arange
    from sklearn import svm, decomposition
    #from pylab import *
    #from matplotlib import pyplot as plt
    
    def load_mnist(dataset="training", digits=arange(10), path="."):
    """
    Loads MNIST files into 3D numpy arrays
    
    Adapted from: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py
    """
    
        if dataset == "training":
            fname_img = os.path.join(path, 'train-images.idx3-ubyte')
            fname_lbl = os.path.join(path, 'train-labels.idx1-ubyte')
        elif dataset == "testing":
            fname_img = os.path.join(path, 't10k-images.idx3-ubyte')
            fname_lbl = os.path.join(path, 't10k-labels.idx1-ubyte')
        else:
            raise ValueError("dataset must be 'testing' or 'training'")
    
        flbl = open(fname_lbl, 'rb')
        magic_nr, size = struct.unpack(">II", flbl.read(8))
        lbl = pyarray("b", flbl.read())
        flbl.close()
    
        fimg = open(fname_img, 'rb')
        magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
        img = pyarray("B", fimg.read())
        fimg.close()
    
        ind = [ k for k in range(size) if lbl[k] in digits ]
        N = len(ind)
    
        images = zeros((N, rows*cols), dtype=uint8)
        labels = zeros((N, 1), dtype=int8)
        for i in range(len(ind)):
            images[i] = array(img[ ind[i]*rows*cols : (ind[i]+1)*rows*cols ])
            labels[i] = lbl[ind[i]]
    
        return images, labels
    
    if __name__ == "__main__":
        images, labels = load_mnist('training', arange(10),"path...")
        pca = decomposition.PCA()
        pca.fit(images)
        pca.n_components = 200
        images_reduced = pca.fit_transform(images)
        lin_classifier = svm.LinearSVC()
        lin_classifier.fit(images_reduced, labels)
        images2, labels2 = load_mnist('testing', arange(10),"path...")
        images2_reduced = pca.transform(images2)
        score = lin_classifier.score(images2_reduced,labels2)
        print score
    

    Thanks for the help!