Understand Unicode and code pages in SQL Server collations

13,170

CHAR vs. NCHAR (ie. Non-Unicode vs. Unicode) defines the character storage encoding. Collations define... collation (ie. sort order and comparison rules). They are different concepts, although often confused.

The confusion stems from the fact that the client tools use the collation of non-Unicode data as hint to choose the code page of the data. See Code Page Architecture. This means that a client like ADO.Net SqlClient can properly encode the single-byte CHAR data received from the server as a multi-byte string .Net object. The column metadata will contain the collation used and so the client will know how to interpret the single-byte data according to a specific code page.

For Unicode (NCHAR) columns the client does not need to interpret the data according to a code page, the data itself is already multi-byte and the client will interpret it according to the UCS-2 encoding (the actual flavor of Unicode used by SQL Server).

However do not confuse this with what collations actually are: rules for comparing characters. As described in Working with Collations:

an English speaker would expect the character string 'Chiapas' to come before 'Colima' in ascending order. However, a Spanish speaker in Mexico might expect words beginning with 'Ch' to appear at the end of a list of words starting with 'C'. Collations dictate these kinds of sorting and comparison rules. The Latin_1 General collation will sort 'Chiapas' before 'Colima' in an ORDER BY ASC clause, whereas the Traditional_Spanish collation will sort 'Chiapas' after 'Colima'.

This sorting rule applies to any data type (CHAR non-Unicode or NCHAR Unicode).

Share:
13,170
Phil
Author by

Phil

Updated on June 13, 2022

Comments

  • Phil
    Phil almost 2 years

    Why all SQL Server 2008 R2 collations are associated to a code page. Are all collations unicode ?

    How to choose a collation when our database is used by several languages using differents code pages ?

    Thank you.