sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

24,139

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

Share:
24,139
pyman
Author by

pyman

Updated on July 20, 2022

Comments

  • pyman
    pyman almost 2 years

    this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.

    I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.

    The problem is that I receive an error saying:

    ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

    Here is my code:

    import sklearn.linear_model as lm
    import numpy as np
    import scipy.stats as st
    import matplotlib.pyplot as plt
    import matplotlib.dates as mdates
    
    
    def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
        strconverter = mdates.strpdate2num(fmt)
        def bytesconverter(b):
            s = b.decode(encoding)
            return strconverter(s)
        return bytesconverter
    
    dataCSV = open('combined_data.csv')
    
    comb_data = []
    
    for line in dataCSV:
        comb_data.append(line)
    
    date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})
    
    
    chntrain = chngdpchange[:-1]
    chntest = chngdpchange[-1:]
    
    austrain = ausgdpchange[:-1]
    austest = ausgdpchange[-1:]
    
    regr = lm.LinearRegression()
    regr.fit(chntrain, austrain)
    
    print('Coefficients: \n', regr.coef_)
    
    print("Residual sum of squares: %.2f"
          % np.mean((regr.predict(chntest) - austest) ** 2))
    
    print('Variance score: %.2f' % regr.score(chntest, austest))
    
    plt.scatter(chntest, austest,  color='black')
    plt.plot(chntest, regr.predict(chntest), color='blue')
    
    plt.xticks(())
    plt.yticks(())
    
    plt.show()
    

    What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.

    I tried the solutions here and here (basically trying to find more out about the solution in the first link), but just receive the exact same error.

    Here is the full traceback in case you want to see it:

    Traceback (most recent call last):
      File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
        regr.fit(chntrain, austrain)
      File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
        y_numeric=True, multi_output=True)
      File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
        check_consistent_length(X, y)
      File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
        "%s" % str(uniques))
    ValueError: Found arrays with inconsistent numbers of samples: [  1 107]