What effects does using a binary collation have?

25,677

Solution 1

Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.

An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the same letter are sorted next to each other)

The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.

Solution 2

utf8_bin: Compares strings by the binary value of each character in the string.

utf8_general_ci: Compares strings using general language rules and using case-insensitive comparisons.

utf8_general_cs: Compares strings using general language rules and using case-sensitive comparisons.

For example, the following will evaluate at true with either of the UTF8_general collations, but not with the utf8_bin collation:

Ä = A Ö = O Ü = U

With the utf8_general_ci collation, they would also return true even if not the same case. http://www.phpbuilder.com/board/showpost.php?s=2e642ac7dc5fceca2dbca1e2b9c424fd&p=10820221&postcount=2

Solution 3

The other answers explain the differences well.

Binary collation can be useful in some cases :

  • column contains hexadecimal data like password hashes
  • you are only interested in exact matches, not sorting
  • for identifiers with only [a-z0-9_] characters, you can even use it for sorting
  • for some reason you store numbers in CHAR() or VARCHAR columns (like telephones)
  • zipcodes
  • UUIDs
  • etc

In all those cases you can save a (little) bit of cpu cycles with a binary collation.

Solution 4

With utf8_general_ci, matches occur without taking case and accentuation into account. It may be a good thing when you need to perform queries on words.

In utf8_bin, the match only occurs when strings are strictly the same. Queries are faster this way.

Share:
25,677
Pekka
Author by

Pekka

Self-employed web developer and graphic designer. After-hours artist. Working from an old off-the-grid house in the Canary Islands. Not doing much here any more because the Stack Overflow I wish to build and participate in is no longer supported and the company running it has started going down a path of incomprehensible, increasingly outright evil actions. E-Mail: first name at gmx dot de

Updated on October 17, 2020

Comments

  • Pekka
    Pekka over 3 years

    While answering this question, I became uncertain about something that I didn't manage to find a sufficient answer to.

    What are the practical differences between using the binary utf8_bin and the case insensitive utf8_general_ci collations?

    I can see three:

    1. Both have a different sorting order; _bin's sorting order is likely to put any umlauts to the end of the alphabet, because byte values are compared (right?)

    2. Only case sensitive searches in _bin

    3. No A = Ä equality in _bin

    Are there any other differences or side-effects to be aware of?

    Reference:

    Similar questions that don't address the issue: