Does PostgreSQL support "accent insensitive" collations?

sql postgresql localization indexing pattern-matching

58,644

Solution 1

Use the unaccent module for that - which is completely different from what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

If you get an error like:

ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory

Install the contrib package on your database server like instructed in this related answer:

Error when creating unaccent extension on PostgreSQL

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

SELECT *
FROM   users
WHERE  unaccent(name) = unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

`unaccent()` only `STABLE` not `IMMUTABLE`

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

It depends on the behavior of a dictionary.
There is no hard-wired connection to this dictionary.
It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

Best for now

This approach is more efficient as other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired schema-qualified function and dictionary.

Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.

The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
  RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;

Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.

public being the schema where you installed the extension (public is the default).

The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.

_{Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.

And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.}

If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;

Finally, the expression index to make queries fast:

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.

Adapt queries to match the index (so the query planner will use it):

SELECT * FROM users
WHERE  f_unaccent(name) = f_unaccent('João');

You don't need the function in the right expression. There you can also supply unaccented strings like 'Joao' directly.

The faster function does not translate to much faster queries using the expression index. That operates on pre-computed values and is very fast already. But index maintenance and queries not using the index benefit.

Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:

'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze

Ligatures

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
E A e a S

You will love this update to unaccent in Postgres 9.6:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
OE AE oe ae ss

Pattern matching

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Can be used for queries like:

SELECT * FROM users
WHERE  f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN and GIST indexes are more expensive to maintain than plain btree:

Difference between GiST and GIN index

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

PostgreSQL accent + case insensitive search

Solution 2

No, PostgreSQL does not support collations in that sense

PostgreSQL does not support collations like that (accent insensitive or not) because no comparison can return equal unless things are binary-equal. This is because internally it would introduce a lot of complexities for things like a hash index. For this reason collations in their strictest sense only affect ordering and not equality.

Workarounds

Full-Text-Search Dictionary that Unaccents lexemes.

For FTS, you can define your own dictionary using unaccent,

CREATE EXTENSION unaccent;

CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
  ALTER MAPPING FOR hword, hword_part, word
  WITH unaccent, simple;

Which you can then index with a functional index,

-- Just some sample data...
CREATE TABLE myTable ( myCol )
  AS VALUES ('fóó bar baz'),('qux quz');

-- No index required, but feel free to create one
CREATE INDEX ON myTable
  USING GIST (to_tsvector('mydict', myCol));

You can now query it very simply

SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) @@ 'foo & bar'

    mycol    
-------------
 fóó bar baz
(1 row)

Unaccent by itself.

The unaccent module can also be used by itself without FTS-integration, for that check out Erwin's answer

Solution 3

I'm pretty sure PostgreSQL relies on the underlying operating system for collation. It does support creating new collations, and customizing collations. I'm not sure how much work that might be for you, though. (Could be quite a lot.)

58,644

Author by

Daniel Serodio

Updated on October 04, 2020

Comments

Daniel Serodio over 3 years
In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like
```
SELECT * FROM users WHERE name LIKE 'João'
```
to find a row with a Joao name.

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.
Daniel Serodio almost 12 years

In your solution, are indexes used, or would I need to create an index on unaccent(name)?
e3matheus almost 11 years

@ErwinBrandstetter In psql 9.1.4, I get "functions in index expression must be marked IMMUTABLE", because of the unaccent function is STABLE, instead of INMUTABLE. What do you recommend?
Erwin Brandstetter almost 11 years

@e3matheus: Feeling guilty for not having tested the previous solution I provided, I investigated and updated my answer with a new and better (IMHO) solution for the problem than what is floating around so far.
Med about 10 years

Isn't the collation utf8_general_ci the answer for this kind of issues?
Erwin Brandstetter about 10 years

@Med: This would apply to MySQL and the better choice would be utf8_unicode_ci there..
Janko over 9 years

Thank you for such a quality answer, you really covered every possibility. Even after reading the thread, I still don't understand why they didn't make it IMMUTABLE. Unaccenting doesn't depend on a dictionary, all rules are stored in my Postgres directory under share/postgresql/tsearch_data/unaccent.rules, so I'm positive it's the same for every dictionary. And why would it change anyway, maybe I'll have some foreign words in English.
Craig Ringer over 8 years

New collation support is currently basically limited to wrappers and aliases for operating system locales. It's very basic. There's no support for filter functions, custom comparators, or any of what you'd need for true custom collations.
Teejay about 8 years

Can we use varchar_pattern_ops instead of gin_trgm_ops?
Erwin Brandstetter about 8 years

@Teejay: It depends. Please ask your question in a new question, comments are not the place. You can always link to this one for context.
electrotype over 6 years

Your answers are as good as Postgres documentation : phenomenal!
Erwin Brandstetter over 4 years

@DanielSerodio: Consider the update. Should make it much faster, yet.
a_horse_with_no_name about 4 years

I wonder if accent-insensitive is now possible with ICU collations.
Erwin Brandstetter about 4 years

@a_horse_with_no_name: I didn't have time to test it out, yet, but that's an intended use case.
Inkling over 3 years

Note that the opening paragraph here is no longer strictly true as of Postgres 12 which introduced nondeterministic collations. However they're still not supported by pattern matching operators.
phlegx almost 2 years

@ErwinBrandstetter using pg_dump and pg_restore I get this error: pg_restore: error: could not execute query: ERROR: text search dictionary "public.unaccent" does not exist LINE 2: SELECT public.immutable_unaccent(regdictionary 'public.una... Any idea what I do wrong?
Erwin Brandstetter almost 2 years

@phlegx: You probably did not install the additional module unaccent (as instructed at the top) in the target database - or installed it to a different schema.
phlegx almost 2 years

@ErwinBrandstetter my file created with pg_dump (pg_dump -Fc -b -f /tmp/dump.sql db_dev) includes CREATE EXTENSION IF NOT EXISTS unaccent WITH SCHEMA public; DROP EXTENSION unaccent; Why pg_dump adds the drop?
Erwin Brandstetter almost 2 years

@phlegx: Not sure why. But if the drop comes after the create, that would explain the problem.