ValueError: negative dimensions are not allowed

python numpy machine-learning scikit-learn

40,037

The problem is because of size mismatch.

The train_labels is actually is the classes of all data. The size of train and train_labels should match.

40,037

Author by

Learner

Updated on July 23, 2022

Comments

Learner almost 2 years

I am playing around with some data from a Kaggle competition on text_analysis, and I keep getting this rather weird error described in the title whenever I try to fit my algorithm. I looked it up, and it had something to with my matrix being to densely populated with nonzero elements while presented as a sparse matrix. I reckon this problem lies with my train_labels below in the code, the labels consist of 24 columns which isn't very common to begin with, labels are floats between 0 and 1 (including 0 and 1). Despite having some idea on what the problem is, I have no idea on how to tackle it properly, and my previous tries haven't worked out so well. Do you guys have any suggestions on how I could solve this?

Code:

import numpy as np
import pandas as p
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import os
from sklearn.linear_model  import RidgeCV

dir = "C:/Users/Anonymous/Desktop/KAGA FOLDER/Hashtags"

def clean_the_text(data):
    alist = []
    data = nltk.word_tokenize(data)
    for j in data:
        alist.append(j.rstrip('\n'))
    alist = " ".join(alist)

    return alist
def loop_data(data):
    for i in range(len(data)):
        data[i] = clean_the_text(data[i])
    return data      

if __name__ == "__main__":
    print("loading data")
    train_text = loop_data(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))
    test_set = loop_data(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))
    train_labels  = np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,4:]



    #Vectorizing
    vectorizer = TfidfVectorizer(max_features = 10000,strip_accents = "unicode",analyzer = "word")
    ridge_classifier = RidgeCV(alphas = [0.001,0.01,0.1,1,10])
    all_data = train_text + test_set
    train_length  = len(train_text)

    print("fitting Vectorizer")
    vectorizer.fit(all_data)
    print("transforming text")
    all_data = vectorizer.transform(all_data)
    train = all_data[:train_length]
    test = all_data[train_length:]

    print("fitting and selecting models") 
    ridge_classifier.fit(train,train_labels)
    print("predicting")
    pred = ridge_classifier.predict(test)


    np.savetxt(dir +"submission.csv", pred, fmt = "%d", delimiter = ",")
    print("submission_file created")

Traceback:

Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 56, in <module>
    ridge_classifier.fit(train,train_labels)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\ridge.py", line 817, in fit
    estimator.fit(X, y, sample_weight=sample_weight)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\ridge.py", line 724, in fit
    v, Q, QT_y = _pre_compute(X, y)
  File "C:\Python27\lib\site-packages\sklearn\linear_model\ridge.py", line 609, in _pre_compute
    K = safe_sparse_dot(X, X.T, dense_output=True)
  File "C:\Python27\lib\site-packages\sklearn\utils\extmath.py", line 78, in safe_sparse_dot
    ret = a * b
  File "C:\Python27\lib\site-packages\scipy\sparse\base.py", line 303, in __mul__
    return self._mul_sparse_matrix(other)
  File "C:\Python27\lib\site-packages\scipy\sparse\compressed.py", line 520, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=np.intc)
ValueError: negative dimensions are not allowed

I suspect that my labels are the problem, so here are the labels:

In [12]:
undefined



import pandas as pd
import numpy as np
import os
dir = "C:\Users\Anonymous\Desktop\KAGA FOLDER\Hashtags"
labels = np.array(pd.read_csv(os.path.join(dir,"train.csv")))[:,4:]
labels


Out[12]:
array([[0.0, 0.0, 1.0, ..., 0.0, 0.0, 0.0],
       [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
       [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
       ..., 
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 0.0],
       [0.0, 0.385, 0.41, ..., 0.0, 0.0, 0.0],
       [0.0, 0.20199999999999999, 0.395, ..., 0.0, 0.0, 0.0]], dtype=object)
In [13]:
undefined



labels.shape
Out[13]:
(77946L, 24L)

Learner over 10 years

I'm sorry, but I'm a bit confused. Since I'm getting the labels out of the train.csv shouldn't the size match with train?
shan.B over 10 years

You are reading form train.csv, but you are doing some processes on the training data. You append training and test data into all_data and get a part of it as train. How you obtain train_length seems confusing to me. Please try to print the size of train this would verify or disprove my theory.
Learner over 10 years

The shape of train is: (77946, 10000).
shan.B over 10 years

Do you really have 10000 attributes on a classification, that's big.
Learner over 10 years

No, there I was only using one attribute, while the whole dataset has 3 attributes. So I'm definitely screwing up here. Thanks for pointing this out!
Learner over 10 years

Ok never mind, I just made a mistake while initializing a variable. Checked my data reading and that seems fine to.