How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours

python pandas geopandas shapely

10,046

I would use spatial index for that. You can use the capability of libpysal, which uses KDTree under the hood. For 2000 random points, following code runs for 3.5s compared to your, which runs for ages (I've lost patience after the first minute). Saving values to list and then transforming the list into the column of DF saves you some time as well.

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)

means = []
for index, row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn3.neighbors[index]
    knnsubset = gdf.iloc[knn_neighbors]
    neighbors = []
    for ix, r in knnsubset.iterrows():
        if r.geometry.distance(row.geometry) < 3: # max distance here
            neighbors.append(ix)

    subset = gdf.iloc[list(neighbors)]
    means.append(np.mean(subset['values']))
gdf['mean'] = means

This is the result:

          points  values       geometry      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  5.666667
3    POINT (2 1)       6    POINT (2 1)  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  4.333333
7    POINT (3 2)       2    POINT (3 2)  3.000000
8    POINT (3 3)       1    POINT (3 3)  3.666667

10,046

Rasmus Mackeprang

Recovering physicist working in finance. My screwdriver of choice has evovled from F77 over C++ to Python. I've also had to pick up a bit of SAS and SQL along the way.

Updated on June 04, 2022

Comments

Rasmus Mackeprang almost 2 years

I have a large (O(10^6) rows) dataset (points with values) where I need to do the following for all points:

Find the 3 nearest points within a predefined radius.
Calculate the mean of an associated value to these three points.
Save that mean value to the point I am looking at

The "non-vectorised" approach would be to simply loop over all points... for all points and then apply the logic. That scales poorly, however.

I have included a toy example that does what I want. Of ideas I have already considered are:

using shapely.ops.nearest_points: That, however, only appears to return the one nearest point.
buffering around each individual point and making an sjoin with the original GeoDataframe: That seems like it would scale even poorer than the naive approach.

Here's a toy example of the logic I want to implement:

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

for index,row in gdf.iterrows(): # Looping over all points
    gdf['dist'] = np.nan
    for index2,row2 in gdf.iterrows(): # Looping over all the other points
        if index==index2: continue
        d=row['geometry'].distance(row2['geometry']) # Calculate distance
        if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store
        else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN
    # Calculating mean of values for the 3 nearest points and storing 
    gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist())

print(gdf)

The resulting GeoDataframe is here:

          points  values       geometry      dist      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  2.758623  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  2.282542  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  2.002498  5.666667
3    POINT (2 1)       6    POINT (2 1)  2.236068  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  1.345362  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  1.004988  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  2.200000  4.333333
7    POINT (3 2)       2    POINT (3 2)  1.000000  3.000000
8    POINT (3 3)       1    POINT (3 3)       NaN  3.666667

You can see the state of the last iteration.

All distances have been calculated apart from the final place which was left at NAN.
The mean value of the last iteration is the mean of the values of the three nearest points: 2, 4 and 5, namely 3,666667.

How do I do this in a more scalable manner?

Erfan almost 5 years

These are quite some calculations and steps, you should add an expected output also in the form of a dataframe.
Rasmus Mackeprang almost 5 years

Added output with commentary.

Rasmus Mackeprang almost 5 years

Thanks @martinfleis. That seems to be exactly what I needed. I had for the time being resorted to something along the lines of for index,row in gdf.iterrows(): # Looping over all points df_tmp=gdf[gdf.geometry.within(row['geometry'].buffer(3))] df_tmp['dist']=df_tmp.geometry.distance(row['geometry']) but this definitely looks like it couls have benefits. Thanks.
martinfleis almost 5 years

You can use your approach, but in that case I would recommend using spatial index for larger number of points, otherwise you would be still checking each point with every other. I still think it would be slower anyway.
Rasmus Mackeprang almost 5 years

OK, now I tried it out, and it does give a few orders of magnitude in speed. Nice. I get a warning message, though: "UserWarning: The weights matrix is not fully connected. There are 159 components". What does that mean?
martinfleis almost 5 years

libpysal.weights.KNN is generating weights matrix, sort of connections between the points. You can imagine it as a network. If you can't reach every point from each of them following this network, it means it is not fully connected. For your purpose, it has no consequences, you can happily ignore it.
Rasmus Mackeprang almost 5 years

OK. I was worried that coinciding points might give this error. Thanks again.