Multi-Class Logistic Regression in SciKit Learn

22,431

You seem to be confusing terms multiclass and multilabel http://scikit-learn.org/stable/modules/multiclass.html , in short:

  • Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

Thus data is [n_samples, n_features] and labels are [n_samples]

  • Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

Thus data is [n_samples, n_features] and labels are [n_samples, n_labels]

And you seem to be looking for multilabel (as for multiclass labels should be 1-dim). Currently, in sklearn, the only methods supporting multilabel are: Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression.

If you want to learn multlabel problem with diffent model, simply use OneVsRestClassifier as a multilabel wrapper around your LogisticRegression

http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier

Share:
22,431
Renée
Author by

Renée

Updated on April 10, 2020

Comments

  • Renée
    Renée about 4 years

    I am having trouble with the proper call of Scikit's Logistic Regression for the multi-class case. I am using the lbgfs solver, and I do have the multi_class parameter set to multinomial.

    It is unclear to me how to pass the true class labels in fitting the model. I had assumed that it was similar/same as for the random forest classifier multi-class, where you pass [n_samples, m_classes] dataframe. However, in doing this, I get an error that the data is of a bad shape. ValueError: bad input shape (20, 5) -- in this tiny example, there were 5 classes, 20 samples.

    On inspection, the documentation for the fit method says that the truth values are passed as [n_samples, ] -- which matches the error i'm getting -- however, I have no idea then how to train the model with multiple classes. So, this is my question: how do i pass the full set of class labels to the fit function?

    i've been unable to find sample code on the Internet to model, nor this question on StackOverflow.. but i feel certain someone must know how to do it!

    in the code below, train_features = [n_samples, nn_features], truth_train = [n_samples, m_classes]

    clf = LogisticRegressionCV(class_weight='balanced', multi_class='multinomial', solver='lbfgs')
    clf.fit(train_features, truth_train)
    pred = clf.predict(test_features)
    
  • Renée
    Renée about 8 years
    Thank you for your response. I am actually looking for multi-class, that is, each sample is of only one class. BUT, what i had done with the forest was convert the class assignments into boolean arrays, that's how I ended up with the n x m array. So, if I understand you correctly, i should convert my class labels to integers and create a single n_sample long array, where the values it can take map to the different class labels. is that correct? Thanks for your help.
  • lejlot
    lejlot about 8 years
    Yes, and you should do the same with trees. Otherwise you fit multilabel model instead.
  • blue-sky
    blue-sky over 6 years
    @lejlot I've used scikit logistic regression for multi prediction of scalars. Answer by dukebody in stackoverflow.com/questions/36760000/… also works for me. Maybe I've misunderstood your response ?