What does collation mean?

124,684

Solution 1

Collation can be simply thought of as sort order.

In English (and it's strange cousin, American), collation may be a pretty simple matter consisting of ordering by the ASCII code.

Once you get into those strange European languages with all their accents and other features, collation changes. For example, though the different accented forms of a may exist at disparate code points, they may all need to be sorted as if they were the same letter.

Solution 2

Besides the "accented letters are sorted differently than unaccented ones" in some Western European languages, you must take into account the groups of letters, which sometimes are sorted differently, also.

Traditionally, in Spanish, "ch" was considered a letter in its own right, same with "ll" (both of which represent a single phoneme), so a list would get sorted like this:

  • caballo
  • cinco
  • coche
  • charco
  • chocolate
  • chueco
  • dado
  • (...)
  • lámpara
  • luego
  • llanta
  • lluvia
  • madera

Notice all the words starting with single c go together, except words starting with ch which go after them, same with ll-starting words which go after all the words starting with a single l. This is the ordering you'll see in old dictionaries and encyclopedias, sometimes even today by very conservative organizations.

The Royal Academy of the Language changed this to make it easier for Spanish to be accomodated in the computing world. Nevertheless, ñ is still considered a different letter than n and goes after it, and before o. So this is a correctly ordered list:

  • Namibia
  • número
  • ñandú
  • ñú
  • obra
  • ojo

By selecting the correct collation, you get all this done for you, automatically :-)

Solution 3

Rules that tell how to compare and sort strings: letters order; whether case matters, whether diacritics matter etc.

For instance, if you want all letters to be different (say, if you store filenames in UNIX), you use UTF8_BIN collation:

SELECT  'A' COLLATE UTF8_BIN = 'a' COLLATE UTF8_BIN

---
0

If you want to ignore case and diacritics differences (say, for a search engine), you use UTF8_GENERAL_CI collation:

SELECT  'A' COLLATE UTF8_GENERAL_CI = 'ä' COLLATE UTF8_GENERAL_CI

---
1

As you can see, this collation (comparison rule) considers capital A and lowecase ä the same letter, ignoring case and diacritic differences.

Solution 4

Collation defines how you sort and compare string values

For example, it defines how to deal with

  • accents (äàa etc)
  • case (Aa)
  • the language context:
    • In a French collation, cote < côte < coté < côté.
    • In the SQL Server Latin1 default , cote < coté < côte < côté
  • ASCII sorts (a binary collation)

Solution 5

Collation means assigning some order to the characters in an Alphabet, say, ASCII or Unicode etc.

Suppose you have 3 characters in your alphabet - {A,B,C}. You can define some example collations for it by assigning integral values to the characters

  1. Example 1 = {A=1,B=2,C=3}
  2. Example 2 = {C=1,B=2,A=3}
  3. Example 3 = {B=1,C=2,A=3}

As a matter of fact, you can define n! collations on an Alphabet of size n. Given such an order, different sorting routines likes LSD/MSD string sorts make use of it for sorting strings.

Share:
124,684
LOVE_KING
Author by

LOVE_KING

*WORKING AS A DB ADMIN,AND I'M NEW TO MYSQL*

Updated on November 14, 2020

Comments

  • LOVE_KING
    LOVE_KING over 3 years

    What does collation mean in SQL, and what does it do?

  • T.J. Crowder
    T.J. Crowder over 13 years
  • Piskvor left the building
    Piskvor left the building over 13 years
    Different accents usually mean that they need to be sorted as if they were different letters - e.g. e,ë,é,ě, and è might be distinct letters for the purposes of ordering (but possibly the same letter when searching, damn those crazy Europeans ;)).
  • Quassnoi
    Quassnoi over 13 years
    @Piskvor: aren't coordinates in your info pointing to a certain country using an alphabet with 42 letters, 15 of them having diacritics?
  • Piskvor left the building
    Piskvor left the building over 13 years
    @Quassnoi: Well yes, and? Am I implying anywhere that I'm not crazy? ;o) (although thank God for Unicode, I'd be stark raving mad if I still had to work with nation-specific charsets)
  • Quassnoi
    Quassnoi over 13 years
    @Piskvor: sorry, sorry, you're perfectly crazy! :)
  • Joe Pineda
    Joe Pineda over 13 years
    This article really explains it all.
  • C.B.
    C.B. about 11 years
    OK, so a collation is quite useful for correct sorting, but is it still useful for comparisons? E.g., i'd like to have 'ñandú' and 'nandu' come up as equals, for practical reasons... Could the collation mechanism help me with this task?
  • Joe Pineda
    Joe Pineda about 11 years
    @C.B.: If you mean something like selecting all rows where a column's value equals 'Aéreo' and SQL S. returns rows with 'aereo', 'aereó', 'AerEO' and the like - yes, setting a specific collation for just a query can do that. You'll have performance issues, though, if this collation's very different from the database's native one. And if you use temporary tables, you must also keep track of tempdb's collation... But you do can do it.
  • C.B.
    C.B. about 11 years
    In that case, could you please explain a little bit more? Particularly, would the straightforward query "SELECT word FROM test WHERE word LIKE 'nandu'" be able to do that? And which collation should i use to have it work? (Note that i'm concerned about diacritic marks, and not just accents...)
  • Dustin Kendall
    Dustin Kendall over 10 years
    @C.B. take a look at the 'AI' or 'AS' part of the collation (Accent Insensitive and Accent Sensitive).
  • cfwschmidt
    cfwschmidt about 9 years
    @Joe Pineda. Maybe it does really explain it all, but does the reader want to know it all? Or is it possible the reader simply wants a concise, reliable answer that covers the most fundamental and commonly used elements and functions of collation so he/she can read it quickly and resume working on whatever task, assignment or project gave rise to the question to begin with?
  • Admin
    Admin over 7 years
    @T.J.Crowder I appreciate the irony in this
  • Royi Namir
    Royi Namir almost 5 years
    Is it only for sort or also for where st= 'aaa' ?
  • Ruben9922
    Ruben9922 over 4 years
    What I found confusing is that you could sort so that the different accented forms of a are adjacent, but still treated as separate letters. For example: aa, ac, åb, åd, bc, xz (a and å are adjacent but treated as separate letters) vs. aa, åb, ac, åd, bc, xz (a and å treated as same letter).
  • Sonny D
    Sonny D over 3 years
    just for note: The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8. dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html