Faster kNN algorithm in Python

python machine-learning scikit-learn knn

11,999

Solution 1

Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)] time. Your algorithm is a direct approach that requires O[N^2] time, and also uses nested for-loops within Python generator expressions which will add significant computational overhead compared to optimized code.

If you'd like to compute weighted k-neighbors classification using a fast O[N log(N)] implementation, you can use sklearn.neighbors.KNeighborsClassifier with the weighted minkowski metric, setting p=2 (for euclidean distance) and setting w to your desired weights. For example:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)

Solution 2

you can take a look at this great article introducing faiss
Make kNN 300 times faster than Scikit-learn’s in 20 lines!
it is on GPU and developed in CPP behind the seen

import numpy as np
import faiss


class FaissKNeighbors:
    def __init__(self, k=5):
        self.index = None
        self.y = None
        self.k = k

    def fit(self, X, y):
        self.index = faiss.IndexFlatL2(X.shape[1])
        self.index.add(X.astype(np.float32))
        self.y = y

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self.y[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions

Solution 3

Modifying your class and using BallTree data structure (with build time O(n.(log n)^2), refer to https://arxiv.org/ftp/arxiv/papers/1210/1210.6122.pdf) with custom DistanceMetric (since Callable functions in the metric parameter are NOT supported for KDTree, as mentioned here as a note: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html), you can use the following code too (also removing the loop for prediction):

from sklearn.neighbors import BallTree
from sklearn.neighbors import DistanceMetric
from scipy.stats import mode

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.tree = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights
        self.tree = BallTree(X_train, \
                             metric=DistanceMetric.get_metric('wminkowski', p=2, w=weights))

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """
        indexes = self.tree.query(testing_data, self.k, return_distance=False)
        y_answers = self.y_train[indexes]
        self.predictions = np.apply_along_axis(lambda x: mode(x)[0], 1, y_answers)
        return self.predictions

Training:

from time import time
n, d = 10000, 2
begin = time()
cls = GlobalWeightedKNN()
X_train = np.random.rand(n,d)
y_train = np.random.choice(2,n, replace=True)
cls.fit(X_train, y_train, k=3, weights=np.random.rand(d))
end = time()
print('time taken to train {} instances = {} s'.format(n, end - begin))
# time taken to train 10000 instances = 0.01998615264892578 s

Testing / prediction:

begin = time()
X_test = np.random.rand(n,d)
cls.predict(X_test)
end = time()
print('time taken to predict {} instances  = {} s'.format(n, end - begin))
#time taken to predict 10000 instances  = 3.732935905456543 s

11,999

Author by

Eoin Ó Coinnigh

Updated on July 24, 2022

Comments

Eoin Ó Coinnigh almost 2 years

I want to code my own kNN algorithm from scratch, the reason is that I need to weight the features. The problem is that my program is still really slow despite removing for loops and using built in numpy functionality.

Can anyone suggest a way to speed this up? I don't use np.sqrt for the L2 distance because it's unnecessary and actually slows it all up quite a bit.

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )

Eoin Ó Coinnigh almost 6 years

Thanks a lot, I had no idea that there was this option in sklearn. I just tested it with the "brute" algorithm (my approach) and interestingly it actually takes 4782sec instead of 2079sec with mine. However the KD Tree is incredibly fast, for sure when I'm using extremely big datasets I will default to this option rather than using my implementation, even though it's not guaranteed to find the nearest neighbours, it's damn close. Thanks!
jakevdp almost 6 years

The KD-tree/Ball tree is an exact algorithm, so it is guaranteed to find the nearest neighbors.
Eoin Ó Coinnigh almost 6 years

Forgive me but I don't think that's true is it? Otherwise what's even the point of having the option of a 'brute' algorithm in sklearn? Everywhere I read tells me that kd_tree and ball_tree are not guaranteed to find the nearest neighbours, but pretty close ones.
jakevdp over 3 years

Late reply with more information: the kdtree and ball_tree are indeed guaranteed to find the exact nearest neighbors. The reason "brute" exists is for two reasons: (1) brute force is faster for small datasets, and (2) it's a simpler algorithm and therefore useful for testing. You can confirm that the algorithms are directly compared to each other in the sklearn unit tests.