How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours
I would use spatial index for that. You can use the capability of libpysal
, which uses KDTree under the hood. For 2000 random points, following code runs for 3.5s compared to your, which runs for ages (I've lost patience after the first minute). Saving values to list and then transforming the list into the column of DF saves you some time as well.
import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal
points=[
'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]
df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})
knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)
means = []
for index, row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn3.neighbors[index]
knnsubset = gdf.iloc[knn_neighbors]
neighbors = []
for ix, r in knnsubset.iterrows():
if r.geometry.distance(row.geometry) < 3: # max distance here
neighbors.append(ix)
subset = gdf.iloc[list(neighbors)]
means.append(np.mean(subset['values']))
gdf['mean'] = means
This is the result:
points values geometry mean
0 POINT (1 1.1) 9 POINT (1 1.1) 6.333333
1 POINT (1 1.9) 8 POINT (1 1.9) 7.000000
2 POINT (1 3.1) 7 POINT (1 3.1) 5.666667
3 POINT (2 1) 6 POINT (2 1) 5.666667
4 POINT (2 2.1) 5 POINT (2 2.1) 4.666667
5 POINT (2 2.9) 4 POINT (2 2.9) 4.333333
6 POINT (3 0.8) 3 POINT (3 0.8) 4.333333
7 POINT (3 2) 2 POINT (3 2) 3.000000
8 POINT (3 3) 1 POINT (3 3) 3.666667
Related videos on Youtube
Rasmus Mackeprang
Recovering physicist working in finance. My screwdriver of choice has evovled from F77 over C++ to Python. I've also had to pick up a bit of SAS and SQL along the way.
Updated on June 04, 2022Comments
-
Rasmus Mackeprang almost 2 years
I have a large (O(10^6) rows) dataset (points with values) where I need to do the following for all points:
- Find the 3 nearest points within a predefined radius.
- Calculate the mean of an associated value to these three points.
- Save that mean value to the point I am looking at
The "non-vectorised" approach would be to simply loop over all points... for all points and then apply the logic. That scales poorly, however.
I have included a toy example that does what I want. Of ideas I have already considered are:
- using shapely.ops.nearest_points: That, however, only appears to return the one nearest point.
- buffering around each individual point and making an sjoin with the original GeoDataframe: That seems like it would scale even poorer than the naive approach.
Here's a toy example of the logic I want to implement:
import pandas as pd import numpy as np from shapely.wkt import loads import geopandas as gp points=[ 'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)', 'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)', 'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)' ] values=[9,8,7,6,5,4,3,2,1] df=pd.DataFrame({'points':points,'values':values}) gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)}) for index,row in gdf.iterrows(): # Looping over all points gdf['dist'] = np.nan for index2,row2 in gdf.iterrows(): # Looping over all the other points if index==index2: continue d=row['geometry'].distance(row2['geometry']) # Calculate distance if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN # Calculating mean of values for the 3 nearest points and storing gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist()) print(gdf)
The resulting GeoDataframe is here:
points values geometry dist mean 0 POINT (1 1.1) 9 POINT (1 1.1) 2.758623 6.333333 1 POINT (1 1.9) 8 POINT (1 1.9) 2.282542 7.000000 2 POINT (1 3.1) 7 POINT (1 3.1) 2.002498 5.666667 3 POINT (2 1) 6 POINT (2 1) 2.236068 5.666667 4 POINT (2 2.1) 5 POINT (2 2.1) 1.345362 4.666667 5 POINT (2 2.9) 4 POINT (2 2.9) 1.004988 4.333333 6 POINT (3 0.8) 3 POINT (3 0.8) 2.200000 4.333333 7 POINT (3 2) 2 POINT (3 2) 1.000000 3.000000 8 POINT (3 3) 1 POINT (3 3) NaN 3.666667
You can see the state of the last iteration.
- All distances have been calculated apart from the final place which was left at NAN.
- The mean value of the last iteration is the mean of the values of the three nearest points: 2, 4 and 5, namely 3,666667.
How do I do this in a more scalable manner?
-
Erfan almost 5 yearsThese are quite some calculations and steps, you should add an expected output also in the form of a dataframe.
-
Rasmus Mackeprang almost 5 yearsAdded output with commentary.
-
Rasmus Mackeprang almost 5 yearsThanks @martinfleis. That seems to be exactly what I needed. I had for the time being resorted to something along the lines of for index,row in gdf.iterrows(): # Looping over all points df_tmp=gdf[gdf.geometry.within(row['geometry'].buffer(3))] df_tmp['dist']=df_tmp.geometry.distance(row['geometry']) but this definitely looks like it couls have benefits. Thanks.
-
martinfleis almost 5 yearsYou can use your approach, but in that case I would recommend using spatial index for larger number of points, otherwise you would be still checking each point with every other. I still think it would be slower anyway.
-
Rasmus Mackeprang almost 5 yearsOK, now I tried it out, and it does give a few orders of magnitude in speed. Nice. I get a warning message, though: "UserWarning: The weights matrix is not fully connected. There are 159 components". What does that mean?
-
martinfleis almost 5 years
libpysal.weights.KNN
is generating weights matrix, sort of connections between the points. You can imagine it as a network. If you can't reach every point from each of them following this network, it means it is not fully connected. For your purpose, it has no consequences, you can happily ignore it. -
Rasmus Mackeprang almost 5 yearsOK. I was worried that coinciding points might give this error. Thanks again.