Finding similar strings with PostgreSQL quickly

47,261

The way you have it, similarity between every element and every other element of the table has to be calculated (almost a cross join). If your table has 1000 rows, that's already 1,000,000 (!) similarity calculations, before those can be checked against the condition and sorted. Scales terribly.

Use SET pg_trgm.similarity_threshold and the % operator instead. Both are provided by the pg_trgm module. This way, a trigram GiST index can be used to great effect.

The configuration parameter pg_trgm.similarity_threshold replaced the functions set_limit() and show_limit() in Postgres 9.6. The deprecated functions still work (as of Postgres 13). Also, performance of GIN and GiST indexes improved in many ways since Postgres 9.1.

Try instead:

SET pg_trgm.similarity_threshold = 0.8;  -- Postgres 9.6 or later
  
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   names n1
JOIN   names n2 ON n1.name <> n2.name
               AND n1.name % n2.name
ORDER  BY sim DESC;

Faster by orders of magnitude, but still slow.

pg_trgm.similarity_threshold is a "customized" option, which can be handled like any other option. See:

You may want to restrict the number of possible pairs by adding preconditions (like matching first letters) before cross joining (and support that with a matching functional index). The performance of a cross join deteriorates with O(N²).

This does not work because you cannot refer to output columns in WHERE or HAVING clauses:

WHERE ... sim > 0.8

That's according to the SQL standard (which is handled rather loosely by certain other RDBMS). On the other hand:

ORDER BY sim DESC

Works because output columns can be used in GROUP BY and ORDER BY. See:

Test case

I ran a quick test on my old test server to verify my claims.
PostgreSQL 9.1.4. Times taken with EXPLAIN ANALYZE (best of 5).

CREATE TEMP table t AS 
SELECT some_col AS name FROM some_table LIMIT 1000;  -- real life test strings

First round of tests with GIN index:

CREATE INDEX t_gin ON t USING gin(name gin_trgm_ops);  -- round1: with GIN index

Second round of tests with GIST index:

DROP INDEX t_gin;
CREATE INDEX t_gist ON t USING gist(name gist_trgm_ops);

New query:

SELECT set_limit(0.8);

SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name
FROM   t n1
JOIN   t n2 ON n1.name <> n2.name
           AND n1.name % n2.name
ORDER  BY sim DESC;

GIN index used, 64 hits: total runtime: 484.022 ms
GIST index used, 64 hits: total runtime: 248.772 ms

Old query:

SELECT (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
FROM   t n1, t n2
WHERE  n1.name != n2.name
AND    similarity(n1.name, n2.name) > 0.8
ORDER  BY sim DESC;

GIN index not used, 64 hits: total runtime: 6345.833 ms
GIST index not used, 64 hits: total runtime: 6335.975 ms

Otherwise identical results. Advice is good. And this is for just 1000 rows!

GIN or GiST?

GIN often provides superior read performance:

But not in this particular case!

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

Share:
47,261

Related videos on Youtube

cdarwin
Author by

cdarwin

Updated on July 09, 2022

Comments

  • cdarwin
    cdarwin almost 2 years

    I need to create a ranking of similar strings in a table.

    I have the following table

    create table names (
    name character varying(255)
    );
    

    Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests:

    CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops);
    

    and I'm executing the following query:

    select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name
    from names n1, names n2
    where n1.name != n2.name and similarity(n1.name, n2.name) > .8
    order by sim desc;
    

    The query works, but is really slow when you have hundreds of names. Moreover, maybe I forgot a bit of SQL, but I don't understand why I cannot use the condition and sim > .8 without getting a "column sim doesn't exist" error.

    I'd like any hint to make the query faster.

  • cdarwin
    cdarwin almost 12 years
    Wonderful answer, thank you. You're right, I could add a condition on the matching of the first letter, but in those "names" I have names and surnames, sometimes written as "name, surname", sometimes as "surname, name" ... My additional question wasn't related to the use of the alias in the order by, but in the where condition. I thought the similarity could be calculated once for each pair.
  • Erwin Brandstetter
    Erwin Brandstetter almost 12 years
    @cdarwin: Ah, I got your subsidiary question wrong, sorry. Amended now. The information was still good - in particular, the link I provided applies, regardless.
  • beldaz
    beldaz over 7 years
    Note set_limit() is now deprecated, in lieu of the similarity_threshold GUC variable.
  • Erwin Brandstetter
    Erwin Brandstetter over 7 years
    @beldaz: Thanks, I added pointers for the upcoming Postgres 9.6.
  • Maxim Yefremov
    Maxim Yefremov almost 4 years
    how to print my current pg_trgm.similarity_threshold ?
  • Erwin Brandstetter
    Erwin Brandstetter almost 4 years
    @MaximYefremov: SHOW pg_trgm.similarity_threshold
  • jiamo
    jiamo almost 4 years
    @ErwinBrandstetter may be need AND n1.name < n2.name
  • Константин Ван
    Константин Ван almost 3 years
    And what does that operator do to performance? Isn't it just another way of doing similarity() calls?
  • Erwin Brandstetter
    Erwin Brandstetter almost 3 years
    @КонстантинВан: Well, yes. But this way a trigram GiST index can be used to great effect. My demo for only 1000 rows should already be clear. Using the index scales close to linearly, while you get O(N²) without. This is proven good. You may have missed the part where this answer was written in 2012 (and only updated just now).
  • Константин Ван
    Константин Ван almost 3 years
  • HMarioD
    HMarioD over 2 years
    Thank you old men!
  • HMarioD
    HMarioD over 2 years
    I am getting an error when trying to set the similarity_threshold. SET pg_trgm.similarity_threshold = _threshold; "ERROR: parameter "pg_trgm.similarity_threshold" requires a numeric value CONTEXT: SQL statement "SET pg_trgm.similarity_threshold = _threshold" PL/pgSQL function questions_get_similar_ones(integer,real) line 3 at SET SQL state: 22023" The _threshold is a variable parameter: CREATE OR REPLACE FUNCTION imp.questions_get_similar_ones( _question integer, _threshold real DEFAULT 0.79)
  • Erwin Brandstetter
    Erwin Brandstetter over 2 years
    @HMarioD: EXECUTE 'SET pg_trgm.similarity_threshold = ' || _threshold; See: stackoverflow.com/a/36025963/939860 (This is safe against SQLi while the input is a numeric type.)
  • HMarioD
    HMarioD over 2 years
    Thanks Erwin great old man, works!!! :)