Clustering geo location coordinates (lat,long pairs) using KMeans algorithm with Python

27,974

Solution 1

k-means is not a good algorithm to use for spatial clustering, for the reasons you meantioned. Instead, you could do this clustering job using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm.

This tutorial demonstrates clustering latitude-longitude spatial data with DBSCAN/haversine and avoids all those Euclidean-distance problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed. Also notice that the eps value is in radians and that .fit() takes the coordinates in radian units for the haversine metric.

Solution 2

It highly depends on your application:

  • Around the equator the results should be fairly accurate. Close to one of the poles the results won't be useful at all.
  • It might, however, work as a pre-pocessing step or for applications with low precision requirements, e.g. small, non-overlapping and very distinct clusters.

If you really need the Haversine formula, you might want to look into this discussion. As Anony-Mousse says:

Note that Haversine distance is not appropriate for k-means or average-linkage clustering, unless you find a smart way of computing the mean that minimizes variance. Do not use the arithmetic average if you have the -180/+180 wrap-around of latitude-longitude coordinates.

Share:
27,974
rok
Author by

rok

Senior infra/software/devops engineer with a strong Python and Java background and a proven track record of delivering high quality solutions. Has particular interest in modern applications architectures which are containerized, orchestrated, highly available, scalable and fault-tolerant. Dives into the guts of a running system to fix issues everyone is happy to pass over. Have experience building PaaS in a private cloud from scratch. Successfully learned any necessary technology at any company and any position to cover the gaps whether it's on site, fully remotely or on customer site. Value provided to companies I worked at always greatly exceeded merely performing my duties.

Updated on July 09, 2022

Comments

  • rok
    rok almost 2 years

    Using the following code to cluster geolocation coordinates results in 3 clusters:

        import numpy as np
        import matplotlib.pyplot as plt
        from scipy.cluster.vq import kmeans2, whiten
    
        coordinates= np.array([
                   [lat, long],
                   [lat, long],
                    ...
                   [lat, long]
                   ])
        x, y = kmeans2(whiten(coordinates), 3, iter = 20)  
        plt.scatter(coordinates[:,0], coordinates[:,1], c=y);
        plt.show()
    

    Is it right to use Kmeans for location clustering, as it uses Euclidean distance and not Haversine formula as a distance function?

  • There
    There almost 2 years
    .as_matrix() is deprecated >0.23.0. Use df.values instead.