Confusion matrix for Clustering in scikit-learn

10,058

Solution 1

I wrote a code myself.

# Compute confusion matrix
def confusion_matrix(act_labels, pred_labels):
    uniqueLabels = list(set(act_labels))
    clusters = list(set(pred_labels))
    cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
    for i, act_label in enumerate(uniqueLabels):
        for j, pred_label in enumerate(pred_labels):
            if act_labels[j] == act_label:
                cm[i][pred_label] = cm[i][pred_label] + 1
    return cm

# Example
labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]
cnf_matrix = confusion_matrix(labels, pred)
print('\n'.join([''.join(['{:4}'.format(item) for item in row])
      for row in cnf_matrix]))

Edit: (Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.

labels=['a','b','c',
        'a','b','c',
        'a','b','c',
        'a','b','c']
pred=[  1,1,2,
        0,1,2,
        1,1,1,
        0,1,2]   

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'Labels': labels, 'Clusters': pred})

# Create crosstab: ct
ct = pd.crosstab(df['Labels'], df['Clusters'])

# Display ct
print(ct)

Solution 2

You can easily compute a pairwise intersection matrix.

But it may be necessary to do this yourself, if the sklearn library has been optimized for the classification use case.

Share:
10,058
Bee
Author by

Bee

Updated on June 04, 2022

Comments

  • Bee
    Bee almost 2 years

    I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.

    I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.

    However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.

    Rows - Actual labels

    Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)

    Is there a way to do this?

    Edit: Here are more details.

    In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.

    That's why it gives a matrix which has the same labels for both rows and columns like this.

    enter image description here

    But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)

    Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.

    ValueError: Mix of label input types (string and number)
    

    This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.

    With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.

    Hope the question is now clearer. Please let me know if it isn't.