Using cosine distance with scikit learn KNeighborsClassifier

23,143

Solution 1

The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)

Notice though, that xT y / (||x|| * ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.

Solution 2

KNN family class constructors have a parameter called metric, you can switch between different distance metrics you want to use in nearest neighbour model. A list of available distance metrics can be found here

If you want to use cosine metric for ranking and classification problem, you can use norm 2 Euclidean distance on normalized feature vector, that gives you same ranking/classification (predictions that made by argmax or argmin operations) results.

Share:
23,143

Related videos on Youtube

Novice
Author by

Novice

Updated on March 15, 2021

Comments

  • Novice
    Novice about 3 years

    Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?

    This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?

    • Artem Sobolev
      Artem Sobolev over 8 years
      KNN classifier is likely to use some optimizations which rely on triangle inequality. The cosine metric doesn't obey it, so correct behavior of KNN is not guaranteed.
    • Raff.Edward
      Raff.Edward over 8 years
      @Barmaley.exe the cosine metric can be made to obey the triangle inequality, and the same ordering can be achieved without an explicit implementation anyway. See my answer for details
  • Itachi
    Itachi almost 5 years
    It was too simple. Thanks Raff for this elegant solution :)
  • oulenz
    oulenz about 4 years
    As already observed in the question. That list does not mention the cosine distance.
  • Eoin Ó Coinnigh
    Eoin Ó Coinnigh about 2 years
    Will this work if the values are positive and negative? Like do you have to change the normalisation?
  • Raff.Edward
    Raff.Edward about 2 years
    This will work for any values, positive or negative. You just need to divide each vector by its 2-norm.