Storing UTF-16/Unicode data in SQL Server

15,489

Solution 1

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

So one fix for the original problem is to adopt SQL Server 2012.

Solution 2

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

Share:
15,489
Admin
Author by

Admin

Updated on July 18, 2022

Comments

  • Admin
    Admin almost 2 years

    According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

    This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

    I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

    Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

  • Admin
    Admin almost 15 years
    You have misunderstood the question. UTF-16 allows for supplementary characters. This works by storing a single character (from the user's perspective) in 2 code units, ie 4 bytes. UCS-2 does not handle supplementary characters. Hence the 4 bytes are treated as two characters by SQL Server when in fact that are one character.
  • Emmanuel Tabard
    Emmanuel Tabard almost 15 years
    That's only for characters outside the standard defined languages. The whitepaper states this is primarily for historical languages.
  • Admin
    Admin almost 15 years
    Comment on the edit: SQL Server works fine on UCS-2 unicode data. UCS-2 is a deprecated standard, windows has used UTF-16 internally since Win2K.
  • Admin
    Admin almost 15 years
    Sure. But to offer Unicode 3.1 support, the full character set should be supported.
  • Admin
    Admin almost 15 years
    Yes, but it does not support the full unicode character set.
  • Triynko
    Triynko about 14 years
    I suspect the reason it sticks with UCS-2 rather than UTF-16 is that UCS-2 limits itself to two bytes in length, but is otherwise identical to UTF-16. This gives UCS-2 a high degree of compatibility with UTF-16, while also offering size consistency that makes the maximum sizes of char(8000 bytes) and nchar(4000 bytes) easier to enforce. Despite any justifications for sticking with UCS-2 over UTF-16, it indeed does NOT support surrogate pairs and therefore does not support the full Unicode character set, and that really really sucks.
  • Triynko
    Triynko about 14 years
    CLR integration will not fix this. If, in fact, a .NET string stores UTF-16 data and SQL server stores UCS-2 data, then the types are ultimately not fully compatible. In other words, if a UTF-16 string (with 4-byte characters) goes into SQL Server and comes back out unscathed, then the decoding process must be incorrect or overly complex and inconsistent. The only legitimate work-arounds are (1) stripping UTF-16 strings of incompatible characters or (2) reading and writing the string's original bytes and processing it as a string only outside of SQL Server.
  • Concrete Gannet
    Concrete Gannet over 11 years
    I want to add my voice to the comments: this answer is wrong and misleading. SQL Server only supports two-byte characters. UTF-16 has some four byte characters.
  • Concrete Gannet
    Concrete Gannet over 11 years
    Hi boomhauer, the question was about Microsoft SQL Server. Your answer may be useful somewhere else.
  • Brady Moritz
    Brady Moritz over 11 years
    wow... something happened here. did i post to the wrong question? I almost wonder if SO screwed this up, since it's been around since feb 2010...
  • Brady Moritz
    Brady Moritz over 11 years
    in fact, i KNOW this answer used to be on another question!
  • Solomon Rutzky
    Solomon Rutzky over 8 years
    While true that SQL Server 2012 introduced the _SC collations which have proper handling of Supplementary Characters, the Question is very specific about pertaining to SQL Server 2005. Also, it is not "UTF-16 + surrogate pairs" since UTF-16 = "UCS-2 + surrogate pairs".
  • Solomon Rutzky
    Solomon Rutzky over 8 years
    @Triynko Do not confuse "storage" with "interpretation". The storage of UCS-2 and UTF-16 are identical since everything is in 2-byte blocks, and Supplementary Characters just happen to be two of those 2-byte blocks in a specific combination. Hence, SQL Server and .NET both store UTF-16 code points. So there is nothing wrong with the decoding process and it is not overly complex or inconsistent. Stripping out surrogate pairs would be needless data loss. And regarding proper handling of built-in string functions, Combining Characters also have "issues" ;-).
  • Solomon Rutzky
    Solomon Rutzky over 8 years
    @ConcreteGannet With respect to only EDIT 2 (which at this point should be the only text of the answer, or at least everything prior should be wrapped in a <del> tag), this answer is not incorrect or misleading. In fact, it is your comment that is incorrect. SQL Server does support the UTF-16 encoding, it is just limited. And UTF-16 contains about as many 4-byte characters as 2-byte characters. Over time that will change as there are less than 63,000 addressable two-byte characters and over 1 million addressable four-byte characters (approx 60k are currently mapped).
  • Concrete Gannet
    Concrete Gannet over 8 years
    @srutzky, yes, you could store and retrieve UTF-16 characters, but to me "support" should mean the character string functions in SQL Server interpret all of UTF-16 correctly too. That was improved in SQL Server 2012.
  • Solomon Rutzky
    Solomon Rutzky over 8 years
    @ConcreteGannet "Support" is a spectrum. There is support for UTF-16 in non-_SC collations, it's just very limited. BUT, none of the collations "properly" handle combining characters that are valid UCS-2 / BMP Code Points. For example: DECLARE @Test NVARCHAR(10); SET @Test = N'te' + NCHAR(0x0301) + N'st'; SELECT NCHAR(55357)+NCHAR(56960) AS [WorksInAnyCollation], NCHAR(128640) AS [OnlyWorksIn_SC_Collations], @Test AS [TestValue], LEN(@Test) AS [Length], RIGHT(@Test, 3) AS [Oops]; The two NCHARs get you a proper Supplementary Character. And the question isn't asking about ideal support.
  • Concrete Gannet
    Concrete Gannet almost 6 years
    @SolomonRutzky, yes, that's why I said "including"