How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours

10,046

I would use spatial index for that. You can use the capability of libpysal, which uses KDTree under the hood. For 2000 random points, following code runs for 3.5s compared to your, which runs for ages (I've lost patience after the first minute). Saving values to list and then transforming the list into the column of DF saves you some time as well.

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)

means = []
for index, row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn3.neighbors[index]
    knnsubset = gdf.iloc[knn_neighbors]
    neighbors = []
    for ix, r in knnsubset.iterrows():
        if r.geometry.distance(row.geometry) < 3: # max distance here
            neighbors.append(ix)

    subset = gdf.iloc[list(neighbors)]
    means.append(np.mean(subset['values']))
gdf['mean'] = means

This is the result:

          points  values       geometry      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  5.666667
3    POINT (2 1)       6    POINT (2 1)  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  4.333333
7    POINT (3 2)       2    POINT (3 2)  3.000000
8    POINT (3 3)       1    POINT (3 3)  3.666667
Share:
10,046

Related videos on Youtube

Rasmus Mackeprang
Author by

Rasmus Mackeprang

Recovering physicist working in finance. My screwdriver of choice has evovled from F77 over C++ to Python. I've also had to pick up a bit of SAS and SQL along the way.

Updated on June 04, 2022

Comments

  • Rasmus Mackeprang
    Rasmus Mackeprang almost 2 years

    I have a large (O(10^6) rows) dataset (points with values) where I need to do the following for all points:

    • Find the 3 nearest points within a predefined radius.
    • Calculate the mean of an associated value to these three points.
    • Save that mean value to the point I am looking at

    The "non-vectorised" approach would be to simply loop over all points... for all points and then apply the logic. That scales poorly, however.

    I have included a toy example that does what I want. Of ideas I have already considered are:

    • using shapely.ops.nearest_points: That, however, only appears to return the one nearest point.
    • buffering around each individual point and making an sjoin with the original GeoDataframe: That seems like it would scale even poorer than the naive approach.

    Here's a toy example of the logic I want to implement:

    import pandas as pd
    import numpy as np
    from shapely.wkt import loads
    import geopandas as gp
    
    points=[
        'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
        'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
        'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
    ]
    values=[9,8,7,6,5,4,3,2,1]
    
    df=pd.DataFrame({'points':points,'values':values})
    gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})
    
    for index,row in gdf.iterrows(): # Looping over all points
        gdf['dist'] = np.nan
        for index2,row2 in gdf.iterrows(): # Looping over all the other points
            if index==index2: continue
            d=row['geometry'].distance(row2['geometry']) # Calculate distance
            if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store
            else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN
        # Calculating mean of values for the 3 nearest points and storing 
        gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist())
    
    print(gdf)
    

    The resulting GeoDataframe is here:

              points  values       geometry      dist      mean
    0  POINT (1 1.1)       9  POINT (1 1.1)  2.758623  6.333333
    1  POINT (1 1.9)       8  POINT (1 1.9)  2.282542  7.000000
    2  POINT (1 3.1)       7  POINT (1 3.1)  2.002498  5.666667
    3    POINT (2 1)       6    POINT (2 1)  2.236068  5.666667
    4  POINT (2 2.1)       5  POINT (2 2.1)  1.345362  4.666667
    5  POINT (2 2.9)       4  POINT (2 2.9)  1.004988  4.333333
    6  POINT (3 0.8)       3  POINT (3 0.8)  2.200000  4.333333
    7    POINT (3 2)       2    POINT (3 2)  1.000000  3.000000
    8    POINT (3 3)       1    POINT (3 3)       NaN  3.666667
    

    You can see the state of the last iteration.

    • All distances have been calculated apart from the final place which was left at NAN.
    • The mean value of the last iteration is the mean of the values of the three nearest points: 2, 4 and 5, namely 3,666667.

    How do I do this in a more scalable manner?

    • Erfan
      Erfan almost 5 years
      These are quite some calculations and steps, you should add an expected output also in the form of a dataframe.
    • Rasmus Mackeprang
      Rasmus Mackeprang almost 5 years
      Added output with commentary.
  • Rasmus Mackeprang
    Rasmus Mackeprang almost 5 years
    Thanks @martinfleis. That seems to be exactly what I needed. I had for the time being resorted to something along the lines of for index,row in gdf.iterrows(): # Looping over all points df_tmp=gdf[gdf.geometry.within(row['geometry'].buffer(3))] df_tmp['dist']=df_tmp.geometry.distance(row['geometry']) but this definitely looks like it couls have benefits. Thanks.
  • martinfleis
    martinfleis almost 5 years
    You can use your approach, but in that case I would recommend using spatial index for larger number of points, otherwise you would be still checking each point with every other. I still think it would be slower anyway.
  • Rasmus Mackeprang
    Rasmus Mackeprang almost 5 years
    OK, now I tried it out, and it does give a few orders of magnitude in speed. Nice. I get a warning message, though: "UserWarning: The weights matrix is not fully connected. There are 159 components". What does that mean?
  • martinfleis
    martinfleis almost 5 years
    libpysal.weights.KNN is generating weights matrix, sort of connections between the points. You can imagine it as a network. If you can't reach every point from each of them following this network, it means it is not fully connected. For your purpose, it has no consequences, you can happily ignore it.
  • Rasmus Mackeprang
    Rasmus Mackeprang almost 5 years
    OK. I was worried that coinciding points might give this error. Thanks again.