sklearn issue: Found arrays with inconsistent numbers of samples when doing regression

python arrays numpy machine-learning scikit-learn

24,139

In fit(X,y),the input parameter X is supposed to be a 2-D array. But if X in your data is only one-dimension, you can just reshape it into a 2-D array like this:regr.fit(chntrain_X.reshape(len(chntrain_X), 1), chntrain_Y)

24,139

Author by

pyman

Updated on July 20, 2022

Comments

pyman almost 2 years

this question seems to have been asked before, but I can't seem to comment for further clarification on the accepted answer and I couldn't figure out the solution provided.

I am trying to learn how to use sklearn with my own data. I essentially just got the annual % change in GDP for 2 different countries over the past 100 years. I am just trying to learn using a single variable for now. What I am essentially trying to do is use sklearn to predict what the GDP % change for country A will be given the percentage change in country B's GDP.

The problem is that I receive an error saying:

ValueError: Found arrays with inconsistent numbers of samples: [ 1 107]

Here is my code:

import sklearn.linear_model as lm
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


def bytespdate2num(fmt, encoding='utf-8'):#function to convert bytes to string for the dates.
    strconverter = mdates.strpdate2num(fmt)
    def bytesconverter(b):
        s = b.decode(encoding)
        return strconverter(s)
    return bytesconverter

dataCSV = open('combined_data.csv')

comb_data = []

for line in dataCSV:
    comb_data.append(line)

date, chngdpchange, ausgdpchange = np.loadtxt(comb_data, delimiter=',', unpack=True, converters={0: bytespdate2num('%d/%m/%Y')})


chntrain = chngdpchange[:-1]
chntest = chngdpchange[-1:]

austrain = ausgdpchange[:-1]
austest = ausgdpchange[-1:]

regr = lm.LinearRegression()
regr.fit(chntrain, austrain)

print('Coefficients: \n', regr.coef_)

print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(chntest) - austest) ** 2))

print('Variance score: %.2f' % regr.score(chntest, austest))

plt.scatter(chntest, austest,  color='black')
plt.plot(chntest, regr.predict(chntest), color='blue')

plt.xticks(())
plt.yticks(())

plt.show()

What am I doing wrong? I essentially tried to apply the sklearn tutorial (They used some diabetes data set) to my own simple data. My data just contains the date, country A's % change in GDP for that specific year, and country B's % change in GDP for that same year.

I tried the solutions here and here (basically trying to find more out about the solution in the first link), but just receive the exact same error.

Here is the full traceback in case you want to see it:

Traceback (most recent call last):
  File "D:\My Stuff\Dropbox\Python\Python projects\test regression\tester.py", line 34, in <module>
    regr.fit(chntrain, austrain)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\linear_model\base.py", line 376, in fit
    y_numeric=True, multi_output=True)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 454, in check_X_y
    check_consistent_length(X, y)
  File "D:\Programs\Installed\Python34\lib\site-packages\sklearn\utils\validation.py", line 174, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [  1 107]