UTF-8 Support, SQL Server 2012 and the UTF8String UDT

15,889

Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions.

If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a VARBINARY. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences. And that conversion would need to be done within the UDT. This means that, like the XML datatype, you would create the UDT to hold a particular value, and then expose a method of that UDT to accept a string parameter to do the comparison (i.e. Utf8String.Compare(alias.field1) or, if defining an operator for the type, then Utf8string1 = Utf8string2 and have the = operator get the string in the UTF-8 encoding and then do the CompareInfo.Compare()).

In addition to the above considerations, you also need to consider that passing values back and forth through the SQLCLR API has a cost, especially when using either NVARCHAR(MAX) or VARBINARY(MAX) as opposed to NVARCHAR(1 - 4000) and VARBINARY(1 - 4000) respectively (please do not confuse this distinction as implying anything about using SqlChars / SqlBytes vs SqlString / SqlBinary).

Finally (at least in terms of using a UDT), please do not look past the fact that the UDT being inquired about is sample code. The only testing noted is purely functional, nothing around scalability or "lessons learned after working with this for a year". The functional test code is shown here at the following CodePlex page and should be looked at before proceeding with this decision as it gives a sense of how you would need to write your queries in order to interact with it (which is fine for a field or two, but not for most / all string fields):

http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/Scripts/Test.sql

Given the number of persisted computed columns and indexes added, was any space really saved? ;-)


Where space (disk, memory, etc) is the concern, you have three options:

  1. If you are using SQL Server 2008 or newer, and are on Enterprise Edition, then you can enable Data Compression. Data Compression can (but won't "always") compress Unicode data in NCHAR and NVARCHAR fields. The determining factors are:

    1. NCHAR(1 - 4000) and NVARCHAR(1 - 4000) use the Standard Compression Scheme for Unicode, but only starting in SQL Server 2008 R2, AND only for IN ROW data, not OVERFLOW! This appears to be better than the regular ROW / PAGE compression algorithm.
    2. NVARCHAR(MAX) and XML (and I guess also VARBINARY(MAX), TEXT, and NTEXT) data that is IN ROW (not off row in LOB or OVERFLOW pages) can be at least PAGE compressed, and maybe also ROW compressed (not sure about this last one).
    3. Any OFF ROW data, LOB or OVERLOW = No Compression For You!
  2. If using a version older than 2008 or not on Enterprise Edition, you can have two fields: one VARCHAR and one NVARCHAR. For example, let's say you are storing URLs which are mostly all base ASCII characters (values 0 - 127) and hence fit into VARCHAR, but sometimes have Unicode characters. Your schema can include the following 3 fields:

      ...
      URLa VARCHAR(2048) NULL,
      URLu NVARCHAR(2048) NULL,
      URL AS (ISNULL(CONVERT(NVARCHAR([URLa])), [URLu])),
      CONSTRAINT [CK_TableName_OneUrlMax] CHECK (
                        ([URLa] IS NOT NULL OR [URLu] IS NOT NULL)
                    AND ([URLa] IS NULL OR [URLu] IS NULL))
    );
    

    In this model you only SELECT from the [URL] computed column. For inserting and updating, you determine which field to use by seeing if converting alters the incoming value, which has to be of NVARCHAR type:

    INSERT INTO TableName (..., URLa, URLu)
    VALUES (...,
            IIF (CONVERT(VARCHAR(2048), @URL) = @URL, @URL, NULL),
            IIF (CONVERT(VARCHAR(2048), @URL) <> @URL, NULL, @URL)
           );
    
  3. If you have fields that should only ever have characters that fit into a particular Code Page of an Extended ASCII character set, then just use VARCHAR.


P.S. Just to have this stated for clarity: the new _SC Collations that were introduced in SQL Server 2012 simply allow for:

  • the built-in functions to properly handle the Supplementary Characters / Surrogate Pairs, and
  • linguistic rules for Supplementary Characters that are used for ordering and comparisons

But, even without the new _SC Collations, you can still store any Unicode character into an XML or N-prefixed type, and retrieve it with no data loss. However, when using the older Collations (i.e. no version number in the name), all Supplementary Characters equate to each other. You need to use the _90 and _100 Collations which at least get you binary / code point comparisons and sorting; they can't take into account linguistic rules since they have no particular mappings of the Supplementary Characters (and hence have no weights or normalization rules).

Try the following:

IF (N'𤪆' = N'𤪆') SELECT N'𤪆' AS [TheLiteral], NCHAR(150150) AS [Generated];
IF (N'𤪆' = N'𤪇') SELECT N'𤪇' AS [TheLiteral], NCHAR(150151) AS [Generated];
IF (N'𤪆' COLLATE Tatar_90_CI_AI = N'𤪇' COLLATE Tatar_90_CI_AI)
       SELECT N'𤪇 COLLATE Tatar_90_CI_AI' AS [TheLiteral], NCHAR(150151) AS [Generated];
IF (N'𤪆' = N'?') SELECT N'?';

In a DB having a default collation ending in _SC, only the first IF statement will return a result set, and the "Generated" field will show the characters correctly.

But, if the DB does not have a default collation ending in _SC, and the collation is not a _90 or _100 series collation, then the first two IF statements return result sets wherein the "Generated" field will return NULL, and the "Literal" field shows up correctly.

For Unicode data, the Collation has no bearing on physical storage.


UPDATE 2018-10-02

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

Share:
15,889
Eric J.
Author by

Eric J.

Eric Johannsen VP of Engineering at GrowFlow Email: my first name at my last name dot us Science Fiction author The Gods We Make The Gods We Seek Ji-min Co-author C# 8.0 in a Nutshell

Updated on June 08, 2022

Comments