How to best implement K-nearest neighbours in C# for large number of dimensions?

12,250

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.

That being said, a few things come to mind in reading your implementation:

  1. You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.

  2. Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.

  3. How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

Share:
12,250
ubuntunoob
Author by

ubuntunoob

Updated on June 05, 2022

Comments

  • ubuntunoob
    ubuntunoob almost 2 years

    I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.

    There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :

    // testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
    // trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
    static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
    {
        Console.WriteLine("Performing KNN with K = "+K);
    
        var testResults = new int[testSamples.Count()]; 
    
        var testNumber = testSamples.Count();
        var trainNumber = trainSamples.Count();
        // Declaring these here so that I don't have to 'new' them over and over again in the main loop, 
        // just to save some overhead
        var distances = new double[trainNumber][]; 
        for (var i = 0; i < trainNumber; i++)
        {
           distances[i] = new double[2]; // Will store both distance and index in here
        }
    
        // Performing KNN ...
        for (var tst = 0; tst < testNumber; tst++)
        {
            // For every test sample, calculate distance from every training sample
            Parallel.For(0, trainNumber, trn =>
            {
                var dist = GetDistance(testSamples[tst], trainSamples[trn]);
                // Storing distance as well as index 
                distances[trn][0] = dist;
                distances[trn][1] = trn;
            });
    
            // Sort distances and take top K (?What happens in case of multiple points at the same distance?)
            var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
    
            // Do a 'majority vote' to classify test sample
            var yea = 0.0;
            var nay = 0.0;
    
            foreach (var voter in votingDistances)
            {
                if (trainClasses[(int)voter[1]] == 1)  
                   yea++;
                else
                   nay++;
            }
            if (yea > nay)
                testResults[tst] = 1;
            else
                testResults[tst] = 0;
    
        }
    
        return testResults;
    }
    
    // Calculates and returns square of Euclidean distance between two vectors
    static double GetDistance(IList<double> sample1, IList<double> sample2)
    {
        var distance = 0.0;
        // assume sample1 and sample2 are valid i.e. same length 
    
        for (var i = 0; i < sample1.Count; i++)
        {   
            var temp = sample1[i] - sample2[i];
            distance += temp * temp;
        }
        return distance;
    }
    

    This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?

    I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.

    I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.

    I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.

    Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

  • ubuntunoob
    ubuntunoob almost 10 years
    Thanks for the suggestions @dbc, I did use the Visual Studio profiler. It showed me that 61% of the runtime is being spent in the GetDistance() function. I also tried changing the Parallel.For loop to include the code of the GetDistance() function instead of a call to the function, which saved me a few seconds at the cost of readability. K is 10, which is quite small, so I will try your other suggestions also.