K-Nearest Neighbor Query in PostGIS

18,480

Solution 1

Just a few thoughts on your problem:

st_distance as well as st_area are not able to use indices. This is because both functions can not be reduced to questions like "Is a within b?" or "Do a and b overlap?". Even more concrete: GIST-indices can only operate on the bounding boxes of two objects.

For more information on this you just could look in the postgis manual, which states an example with st_distance and how the query could be improved to perform better.

However, this does not solve your k-nearest-neighbour-problem. For that, right now I do not have a good idea how to improve the performance of the query. The only chance I see would be assuming that the k nearest neighbors are always in a distance of below x meters. Then you could use a similar approach as done in the postgis manual.

Your second query could be speeded up a bit. Currently, you compute the area for each object in table 1 as often as table has rows - the strategy is first to join the data and then select based on that function. You could reduce the count of area computations significantly be precomputing the area:

WITH polygonareas AS (
    SELECT gid, the_geom, st_area(the_geom) AS area
    FROM polygons
)
SELECT g1.gid, g2.gid
FROM polygonareas as g1 , polygonareas as g2 
WHERE g1.area > g2.area;

Your third query can be significantly optimized using bounding boxes: When the bounding boxes of two objects do not overlap, there is no way the objects do. This allows the usage of a given index and thus a huge performance gain.

Solution 2

Since late September 2011, PostGIS has supported indexed nearest neighbor queries via a special operator(s) usable in the ORDER BY clause:

SELECT name, gid
FROM geonames
ORDER BY geom <-> st_setsrid(st_makepoint(-90,40),4326)
LIMIT 10;

...will return the 10 objects whose geom is nearest -90,40 in a scalable way. A few more details (options and caveats) are in that announcement post and use of the <-> and the <#> operators is also now documented in the official PostGIS 2.0 reference. (The main difference between the two is that <-> compares the shape centroids and <#> compares their boundaries — no difference for points, other shapes choose what is appropriate for your queries.)

Solution 3

You can do it with KNN index and lateral join.

SELECT v.gid, v2.gid,st_distance(v.the_geom, v2.the_geom)
  FROM geonames v, 
       lateral(select * 
                 from geonames v2
                where v2.id<>v.id
                ORDER BY v.the_geom <-> v2.the_geom LIMIT 10) v2
where v.gid in (...) - or other filtering condition

Solution 4

What you may need is the KNN index which is hopefully available soon in PostGIS 2.x and PostgreSQL 9.1: See http://blog.opengeo.org/tag/knn/

Share:
18,480
Abhishek Sagar
Author by

Abhishek Sagar

i am 28 , Mtech in computer science at IIT Bombay , mumbai , india. I have joined Brocade Communications at bangalore in july 2012. I have done my Mtech thesis on Distributed computation on spatial data. My skype : abhishek004hbti email : [email protected] regards my friends .

Updated on June 19, 2022

Comments

  • Abhishek Sagar
    Abhishek Sagar almost 2 years

    I am using the following Nearest Neighbor Query in PostGIS :

    SELECT g1.gid g2.gid FROM points as g1, polygons g2   
    WHERE g1.gid <> g2.gid
    ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom)
    LIMIT k;
    

    Now, that I have created indexes on the_geom as well as gid column on both the tables, this query is taking much more time than other spatial queries involving spatial joins b/w two tables.

    Is there any better way to find K-nearest neighbors? I am using PostGIS.

    And, another query which is taking a unusually long time despite creating indexes on geometry column is:

    select g1.gid , g2.gid from polygons as g1 , polygons as g2
    where st_area(g1.the_geom) > st_area(g2.the_geom) ;
    

    I believe, these queries arent benefited by gist indexes, but why?

    Whereas this query:

    select a.polyid , sum(length(b.the_geom)) from polygon as a , roads as b  
    where st_intersects(a.the_geom , b.the_geom);
    

    returns result after some time despite involving "roads" table which is much bigger than polygons or points table and also involve more complex spatial operators.

  • John Powell
    John Powell about 10 years
    A major caveat of these two operators, as it says on the linked postgis reference pages, is that the spatial index will only kick in if one of the geometries is a constant, as in your st_makepoint in the example. This means you can't use these operators with efficient index usage to answer the OP question which involves finding all geometries A near some other set of geometries B.
  • natevw
    natevw about 10 years
    Ah, good point. Thanks for raising it. So is @Stefan's answer the "correct" one then, just needing a bit more detail and updated link(s)?