Fast Haversine Approximation (Python/Pandas)

34,047

Solution 1

Here is a vectorized numpy version of the same function:

import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.

For example, with randomly generated values:

>>> import numpy as np
>>> import pandas
>>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000)
>>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
>>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Or if you want to create another column:

>>> df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.

This is an example of vectorization.

Solution 2

Purely for the sake of an illustrative example, I took the numpy version in the answer from @ballsdotballs and also made a companion C implementation to be called via ctypes. Since numpy is such a highly optimized tool, there is little chance that my C code will be as efficient, but it should be somewhat close. The big advantage here is that by running through an example with C types, it can help you see how you can connect up your own personal C functions to Python without too much overhead. This is especially nice when you just want to optimize a small piece of a bigger computation by writing that small piece in some C source rather than Python. Simply using numpy will solve the problem most of the time, but for those cases when you don't really need all of numpy and you don't want to add the coupling to require use of numpy data types throughout some code, it's very handy to know how to drop down to the built-in ctypes library and do it yourself.

First let's create our C source file, called haversine.c:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int haversine(size_t n, 
              double *lon1, 
              double *lat1, 
              double *lon2, 
              double *lat2,
              double *kms){

    if (   lon1 == NULL 
        || lon2 == NULL 
        || lat1 == NULL 
        || lat2 == NULL
        || kms == NULL){
        return -1;
    }

    double km, dlon, dlat;
    double iter_lon1, iter_lon2, iter_lat1, iter_lat2;

    double km_conversion = 2.0 * 6367.0; 
    double degrees2radians = 3.14159/180.0;

    int i;
    for(i=0; i < n; i++){
        iter_lon1 = lon1[i] * degrees2radians;
        iter_lat1 = lat1[i] * degrees2radians;
        iter_lon2 = lon2[i] * degrees2radians;
        iter_lat2 = lat2[i] * degrees2radians;

        dlon = iter_lon2 - iter_lon1;
        dlat = iter_lat2 - iter_lat1;

        km = pow(sin(dlat/2.0), 2.0) 
           + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0);

        kms[i] = km_conversion * asin(sqrt(km));
    }

    return 0;
}

// main function for testing
int main(void) {
    double lat1[2] = {16.8, 27.4};
    double lon1[2] = {8.44, 1.23};
    double lat2[2] = {33.5, 20.07};
    double lon2[2] = {14.88, 3.05};
    double kms[2]  = {0.0, 0.0};
    size_t arr_size = 2;

    int res;
    res = haversine(arr_size, lon1, lat1, lon2, lat2, kms);
    printf("%d\n", res);

    int i;
    for (i=0; i < arr_size; i++){
        printf("%3.3f, ", kms[i]);
    }
    printf("\n");
}

Note that we're trying to keep with C conventions. Explicitly passing data arguments by reference, using size_t for a size variable, and expecting our haversine function to work by mutating one of the passed inputs such that it will contain the expected data on exit. The function actually returns an integer, which is a success/failure flag that could be used by other C-level consumers of the function.

We're going to need to find a way to handle all of these little C-specific issues inside of Python.

Next let's put our numpy version of the function along with some imports and some test data into a file called haversine.py:

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

I chose to make lats and lons (in degrees) that are randomly chosen between 0 and 50, but it doesn't matter too much for this explanation.

The next thing we need to do is to compile our C module in such a way that it can be dynamically loaded by Python. I'm using a Linux system (you can find examples for other systems very easily on Google), so my goal is to compile haversine.c into a shared object, like so:

gcc -shared -o haversine.so -fPIC haversine.c -lm

We can also compile to an executable and run it to see what the C program's main function displays:

> gcc haversine.c -o haversine -lm
> ./haversine
0
1964.322, 835.278, 

Now that we have compiled the shared object haversine.so, we can use ctypes to load it in Python and we need to supply the path to the file to do so:

lib_path = "/path/to/haversine.so" # Obviously use your real path here.
haversine_lib = ctypes.CDLL(lib_path)

Now haversine_lib.haversine acts pretty much just like a Python function, except that we might need to do some manual type marshaling to make sure the inputs and outputs are interpreted correctly.

numpy actually provides some nice tools for this and the one I'll use here is numpy.ctypeslib. We're going to build a pointer type that will allow us to pass around numpy.ndarrays to these ctypes-loaded functions as through they were pointers. Here's the code:

arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                       ndim=1, 
                                       flags='CONTIGUOUS')

haversine_lib.haversine.restype = ctypes.c_int
haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                    arr_1d_double, 
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double,
                                    arr_1d_double] 

Notice that we tell the haversine_lib.haversine function proxy to interpret its arguments according to the types we want.

Now, to test it out from Python what remains is to just make a size variable, and an array that will be mutated (just like in the C code) to contain the result data, then we can call it:

size = len(lat1)
output = np.empty(size, dtype=np.double)
print "====="
print output
t2 = time.time()
res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
t3 = time.time()
print t3 - t2, res
print type(output), output

Putting it all together in the __main__ block of haversine.py, the whole file now looks like this:

import time
import ctypes
import numpy as np
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (np.sin(dlat/2)**2 
         + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)
    c = 2 * np.arcsin(np.sqrt(a)) 
    km = 6367 * c
    return km

if __name__ == "__main__":
    lat1 = 50.0 * np.random.rand(1000000)
    lon1 = 50.0 * np.random.rand(1000000)
    lat2 = 50.0 * np.random.rand(1000000)
    lon2 = 50.0 * np.random.rand(1000000)

    t0 = time.time()
    r1 = haversine(lon1, lat1, lon2, lat2)
    t1 = time.time()
    print t1-t0, r1

    lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so"
    haversine_lib = ctypes.CDLL(lib_path)
    arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, 
                                           ndim=1, 
                                           flags='CONTIGUOUS')

    haversine_lib.haversine.restype = ctypes.c_int
    haversine_lib.haversine.argtypes = [ctypes.c_size_t,
                                        arr_1d_double, 
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double,
                                        arr_1d_double]

    size = len(lat1)
    output = np.empty(size, dtype=np.double)
    print "====="
    print output
    t2 = time.time()
    res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)
    t3 = time.time()
    print t3 - t2, res
    print type(output), output

To run it, which will run and time the Python and ctypes versions separately and print some results, we can just do

python haversine.py

which displays:

0.111340045929 [  231.53695005  3042.84915093   169.5158946  ...,  1359.2656769
  2686.87895954  3728.54788207]
=====
[  6.92017600e-310   2.97780954e-316   2.97780954e-316 ...,
   3.20676686e-001   1.31978329e-001   5.15819721e-001]
0.148446083069 0
<type 'numpy.ndarray'> [  231.53675618  3042.84723579   169.51575588 ...,  1359.26453029
  2686.87709456  3728.54493339]

As expected, the numpy version is slightly faster (0.11 seconds for vectors with length of 1 million) but our quick and dirty ctypes version is no slouch: a respectable 0.148 seconds on the same data.

Let's compare this to a naive for-loop solution in Python:

from math import radians, cos, sin, asin, sqrt

def slow_haversine(lon1, lat1, lon2, lat2):
    n = len(lon1)
    kms = np.empty(n, dtype=np.double)
    for i in range(n):
       lon1_v, lat1_v, lon2_v, lat2_v = map(
           radians, 
           [lon1[i], lat1[i], lon2[i], lat2[i]]
       )

       dlon = lon2_v - lon1_v 
       dlat = lat2_v - lat1_v 
       a = (sin(dlat/2)**2 
            + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2)
       c = 2 * asin(sqrt(a)) 
       kms[i] = 6367 * c
    return kms

When I put this into the same Python file as the others and time it on the same million-element data, I consistently see a time of around 2.65 seconds on my machine.

So by quickly switching to ctypes we improve the speed by a factor of about 18. For many calculations that can benefit from access to bare, contiguous data, you often see gains much higher even than this.

Just to be super clear, I am not at all endorsing this as a better option than just using numpy. This is precisely the problem that numpy was built to solve, and so homebrewing your own ctypes code whenever it both (a) makes sense to incorporate numpy data types in your application and (b) there exists an easy way to map your code into a numpy equivalent, is not very efficient.

But it's still very helpful to know how to do this for those occasions when you prefer to write something in C yet call it in Python, or situations where a dependence on numpy is not practical (in an embedded system where numpy cannot be installed, for example).

Solution 3

In case that using scikit-learn is allowed, I would give the following a chance:

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

# example data
lat1, lon1 = 36.4256345, -5.1510261
lat2, lon2 = 40.4165, -3.7026
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

X = [[lat1, lon1],
     [lat2, lon2]]
kms = 6367
print(kms * dist.pairwise(X))

Solution 4

A trivial extension to @derricw's vectorised solution, you can use numba to improve performance by ~2x with virtually no change to your code. For pure numerical calculations, this should probably be used for benchmarking / testing versus possibly more efficient solutions.

from numba import njit

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

Benchmarking versus the Pandas function:

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop

Full benchmarking code:

import pandas as pd, numpy as np
from numba import njit

def haversine_pd(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

@njit
def haversine_nb(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians(lon1), np.radians(lat1), np.radians(lon2), np.radians(lat2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    return 6367 * 2 * np.arcsin(np.sqrt(a))

np.random.seed(0)
lon1, lon2, lat1, lat2 = np.random.randn(4, 10**7)
df = pd.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
km = haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
km_nb = haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)

assert np.isclose(km.values, km_nb).all()

%timeit haversine_pd(df['lon1'], df['lat1'], df['lon2'], df['lat2'])
# 1 loop, best of 3: 1.81 s per loop

%timeit haversine_nb(df['lon1'].values, df['lat1'].values, df['lon2'].values, df['lat2'].values)
# 1 loop, best of 3: 921 ms per loop

Solution 5

The vectorized function specifies that "All args must be of equal length". By extending the bounds of the "larger" dataset, according to this, one can efficiently find the distance all i,j pairs of elements.

from random import uniform
import numpy as np

def new_haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1[:,None]

    dlat = lat2 - lat1[:,None]

    a = np.sin(dlat/2.0)**2 + np.cos(lat1[:,None]) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

lon1 = [uniform(-180,180) for n in range(6)]
lat1 = [uniform(-90, 90) for n in range(6)]
lon2 = [uniform(-180,180) for n in range(4)]
lat2 = [uniform(-90, 90) for n in range(4)]

new = new_haversine_np(lon1, lat1, lon2, lat2)

for i in range(6):
    for j in range(4):
        print(i,j,round(new[i,j],2))
Share:
34,047

Related videos on Youtube

Nyxynyx
Author by

Nyxynyx

Hello :) I have no formal education in programming :( And I need your help! :D These days its web development: Node.js Meteor.js Python PHP Laravel Javascript / jQuery d3.js MySQL PostgreSQL MongoDB PostGIS

Updated on May 02, 2022

Comments

  • Nyxynyx
    Nyxynyx about 2 years

    Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!

    Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?

    from math import radians, cos, sin, asin, sqrt
    def haversine(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        km = 6367 * c
        return km
    
    
    for index, row in df.iterrows():
        df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])
    
    • ely
      ely about 9 years
      A better approach than approximation would be to profile the function to get a sense of exactly why it takes too long, followed by using ctypes/Cython/numba to translate the function as-is into a C function that runs without as much overhead. You may need to modify your calling convention to use the numpy array values of data underlying each pandas Series column of data, and you can also checkout numpy.ctypeslib for easy conversion from a numpy array to a ctypes-compatible array. It seems like a lot, but really it's a pretty easy way to access C functions in Python.
    • Steven Rumbalski
      Steven Rumbalski about 9 years
      It may be possible to avoid doing the calculation for a majority of candidates. Calculate the min and max longitutes and latitudes 50 miles from your starting point. Then use those those mins and maxes to weed out most of the candidates.
    • ely
      ely about 9 years
      You could also consider constructing a k-d tree out of the data, rather than storing it in a relational structure like a DataFrame. Then it would be cheap to get neighbors of a given point, and perhaps you could only calculate distances on demand. Does the application always need every pair? Yet another option could be to cluster the points and use each cluster's centroid/mean as a proxy. Then the distance between any two points would be approximated by the distance between only the cluster centers. It's speculative whether anything fancy like this is really better than brute force though.
    • ely
      ely about 9 years
      In the clustering case, you might also want to calculate true distances for all pairs in the same cluster (for each cluster) so that those points don't have effective distance of zero. This is also amenable to simple parallelization: give each process a copy of the data (or have it load a copy) along with a list of indices that it is responsible for. Then that process computes pairwise distance of those indices against all other indices, and writes them somewhere. With only millions of rows, this should be reasonable on modest hardware.
    • Steven Rumbalski
      Steven Rumbalski about 9 years
      There is a haversine module on pypi that is implemented in C.
    • Nyxynyx
      Nyxynyx about 9 years
      Wow the suggestions in the comments are beyond me! I was wondering whether I can do just sqrt((lat2 - lat1)^^2 + (lon2 - lon1)^^2) will that even work? All the points lie within new york city.
    • derricw
      derricw about 9 years
      @Nyxynyx The function you provided in your question gives the great circle distance. The calculation in your comment gives the euclidean distance. Because the radius of the earth is so large, you can absolutely approximate with the euclidean version for small distances.
    • Nyxynyx
      Nyxynyx about 9 years
      @ballsdotballs Seems like the euclidean distance will work for me! How should the result of sqrt((lat2 - lat1)^^2 + (lon2 - lon1)^^2) be converted into miles?
    • reptilicus
      reptilicus about 9 years
      Yeah the euclidian approximation will work fine for small enough distances. You shouldn't even need to do an apply for that, can directly just use the columns in the dataframe.
    • dustin
      dustin about 8 years
      Are your lat1, long1, lat1, long1 that are calculated all on the same row?
    • esperluette
      esperluette about 5 years
  • Divakar
    Divakar about 9 years
    Good to know about that term array programming, didn't come across it with MATLAB.
  • Dennis Golomazov
    Dennis Golomazov over 7 years
    Thank you very much for this. Small suggestion: add a real-world example usage with actual coordinates, instead of random values, to clarify the input format.
  • Dennis Golomazov
    Dennis Golomazov over 7 years
    Note that this also works when one pair of arguments is a Series and another is a tuple: haversine_np(pd.Series([-74.00594, -122.41942]), pd.Series([40.71278, 37.77493]), -87.65005, 41.85003) calculates distance between (New York, San Francisco) and Chicago.
  • Dennis Golomazov
    Dennis Golomazov over 7 years
    Another minor suggestion: you might want to exchange the order of function arguments to lat, lon. In many sources latitude goes first, e.g. in en.wikipedia.org/wiki/Horizontal_position_representation.
  • LetsPlayYahtzee
    LetsPlayYahtzee almost 6 years
    u know it actually works even with different sized arrays ;) through dimension projection
  • Leandro Lima
    Leandro Lima almost 6 years
    @DennisGolomazov afaik GIS libraries and software usually order it as lon, lat. Latitude first seems to be mainly a Google Maps choice. I'd personally find it odd as lat, lon.
  • MCMZL
    MCMZL over 5 years
    Just watch out that the arguments order shall be lat, lon unlike in many GIS libraries
  • OverflowingTheGlass
    OverflowingTheGlass over 5 years
    How would I tie the resulting array back to the original df?
  • DGenchev
    DGenchev about 4 years
    How can it be applied to different sized arrays, i.e. to find the distance between each point in the first array and every point in the second array?
  • Zach
    Zach almost 4 years
    I made a feature request to sklearn to add your code: github.com/scikit-learn/scikit-learn/issues/17212
  • NFern
    NFern almost 4 years
    This is awesome!
  • Cninroh
    Cninroh over 3 years
    It should also be noted that numpy implementation in this specific case will be 12 times slower than implementation using from math import radians, cos, sin, asin, sqrt and doing it a bit more manually. numpy benefits will show face when you apply it to a whole dataframe at once.