Unicode to Non-Unicode conversion

16,018

Solution 1

There are a few things to note here:

  1. If you want to see exactly which character is there, you can convert the value to VARBINARY which will give you the hex / binary value of all characters in the string and there is no concept of "hidden" characters in hex:

    DECLARE @PostalCode NVARCHAR(20);
    SET @PostalCode = N'053000'+ NCHAR(0x2008); -- 0x2008 = "Punctuation Space"
    SELECT @PostalCode AS [NVarCharValue],
           CONVERT(VARCHAR(20), @PostalCode) AS [VarCharValue],
           CONVERT(VARCHAR(20), RTRIM(@PostalCode)) AS [RTrimmedVarCharValue],
           CONVERT(VARBINARY(20), @PostalCode) AS [VarBinaryValue];
    

    Returns:

    NVarCharValue   VarCharValue   RTrimmedVarCharValue   VarBinaryValue
    053000          053000?        053000?                0x3000350033003000300030000820
    

    NVARCHAR data is stored as UTF-16 which works in 2-byte sets. Looking at the last 4 hex digits to see what the hidden 2-byte set is, we see "0820". Since Windows and SQL Server are UTF-16 Little Endian (i.e. UTF-16LE), the bytes are in reverse order. Flipping the final 2 bytes -- 08 and 20 -- we get "2008", which is the "Punctuation Space" that we added via NCHAR(0x2008).

    Also, please note that RTRIM did not help at all here.

  2. Simplistically, you can just replace the question marks with nothing:

    SELECT REPLACE(CONVERT(VARCHAR(20), [PostalCode]), '?', '');
    
  3. More importantly, you should convert the [PostalCode] field to VARCHAR so that it doesn't store these characters. No country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype, at least as far as I have ever read about (see bottom section for references). In fact, what is allowed is a rather small subset of ASCII, which means you can easily filter on the way in (or just do the same REPLACE as shown above when inserting or updating):

    ALTER TABLE [table] ALTER COLUMN [PostalCode] VARCHAR(20) [NOT]? NULL;
    

    Be sure to check the current NULL / NOT NULL setting for the column and make it the same in the ALTER statement above, else it could be changed as the default is NULL if not specified.

  4. If you cannot change the schema of the table and need to do a periodic "cleansing" of the bad data, you can run the following:

    ;WITH cte AS
    (
       SELECT *
       FROM   TableName
       WHERE  [PostalCode] <>
                      CONVERT(NVARCHAR(50), CONVERT(VARCHAR(50), [PostalCode]))
    )
    UPDATE cte
    SET    cte.[PostalCode] = REPLACE(CONVERT(VARCHAR(50), [PostalCode]), '?', '');
    

    Please keep in mind that the above query is not meant to work efficiently if the table has millions of rows. At that point it would need to be handled in smaller sets via a loop.


For reference, here is the wikipedia article for Postal code, which currently states that the only characters ever used are:

  • The arabic numerals "0" to "9"
  • Letters of the ISO basic Latin alphabet
  • Spaces, hyphens

And regarding the max size of the field, here is the Wikipedia List of postal codes

Solution 2

As long as "?" is not allowed in a real PostalCode value, you can cast first, and then remove these characters using REPLACE to replace them with the empty string:

replace(cast((PostalCode) as varchar))), '?', '')

Warning

These '?' characters represent Unicode characters in the original nvarchar value that were not converted to an equivalent ASCII varchar character. This means that this method will silently remove any such characters. You said you wanted to simply remove these characters, but you might want to rethink that.

To give an example, if the postal code can contain letters, and someone accidentally typed a letter with an accent:

1234-ÁBCD

The end result would be:

1234-BCD

Share:
16,018

Related videos on Youtube

Rameshwar Pawale
Author by

Rameshwar Pawale

Updated on June 04, 2022

Comments

  • Rameshwar Pawale
    Rameshwar Pawale almost 2 years

    I have some Unicode characters in an NVarchar field named "PostalCode". When I convert them to Varchar, there is a ? in the result.

    My code is:

    select PostalCode, cast((PostalCode) as varchar)) as val from  table
    

    and the result is:

    PostalCode       |   val
    053000           | 053000?
    

    Here I am getting a ? in the result. Is there any way to remove such special characters?

    • Panagiotis Kanavos
      Panagiotis Kanavos over 9 years
      Why are you trying to convert Unicode? It's better to use Unicode throughout an application. Anyway ? means there are characters in the input string that can't be represented in the target codepage. Does your input contain invisible characters after the last number? I suspect LEN(PostalCode) will return 7 or higher.
    • Dan Getz
      Dan Getz over 9 years
      You can probably expect all postal codes to be expressible in [0-9][A-Z]. The one thing you might worry about is if the postal code can contain letters, and someone accidentally typed a letter with an accent. For example, if you process these by dropping non-ASCII characters, "1234-ÁBCD" would not indicate an error, and simply become "1234-BCD" instead of "1234-ABCD".
  • user
    user over 9 years
    Except the original value doesn't appear to show anything that the conversion output doesn't include. It looks like the conversion is just adding a question mark (whatever that question mark might represent).
  • Panagiotis Kanavos
    Panagiotis Kanavos over 9 years
    The original value probably contains garbage non-visible characters after the last 0, otherwise it could be converted to any codepage. Numbers and the space character are available in all codepages
  • rory.ap
    rory.ap over 9 years
    @MichaelKjörling Maybe there's a unicode character in the table that wasn't carried over in the copy/paste to this post.
  • Panagiotis Kanavos
    Panagiotis Kanavos over 9 years
    @MichaelKjörling Unicode contain single characters like ... or , (yes, with the space). There are also at least 16 different spaces, joiners etc (U+2000 to U+200F)
  • Rameshwar Pawale
    Rameshwar Pawale over 9 years
    @roryap - there is a space after 0 in original unicode value.
  • rory.ap
    rory.ap over 9 years
    @RameshwarPawale -- Are you sure it was U+0020 (regular space) and not U+00A0 (non-breaking space) or U+2002 ("en-space") or U+2003 ("em-space") or one of the others? That's what I'm getting at. BTW, it seems like it's not possible to copy and paste those actual spaces into SO -- they get converted into U+0020, which is why in your post it looks like a normal space.
  • rory.ap
    rory.ap over 9 years
    This assumes that the OP has control over the DB schema.
  • Solomon Rutzky
    Solomon Rutzky over 9 years
    @roryap . Understood. However, it is just a recommendation. I did provide a way to accomplish the question, though I can make an update to account for having no control over the schema.
  • Jonathan Leffler
    Jonathan Leffler over 9 years
    When you say "No country allows for Unicode characters in their postal codes", you are making a sweeping statement that isn't 100% accurate, since the Latin alphabet and Arabic digits are perfectly valid Unicode characters. You mean something more like "No country allows for Unicode characters outside the range U+0020 .. U+007E (equivalent to ASCII or the printing characters in the lower half of ISO 8859-1) in their postal codes". I'm not sure how to express that compactly; maybe "No country allows for Unicode characters that are not also ASCII printing characters in the postal codes".
  • Solomon Rutzky
    Solomon Rutzky over 9 years
    @JonathanLeffler Thanks for commenting. That is why I copied the text of the Wikipedia article stating "ISO basic Latin alphabet". So maybe "no country uses letters that are not represented in the ASCII character set and that are not valid for the VARCHAR datatype"? Shouldn't need to mention printability as the only other characters are 0 - 9, space, and hyphen.
  • Solomon Rutzky
    Solomon Rutzky over 9 years
    What if it is multiple characters? And/or what if they are different Unicode characters?
  • Dan Getz
    Dan Getz over 9 years
    Then my second method is not the right thing to do. (But it might still be worth trying RTRIM to see if it works.) This is an answer to one interpretation of the original question, "how do I get rid of only this particular character on the end".
  • Jonathan Leffler
    Jonathan Leffler over 9 years
    Yes, I know it's at the bottom, but maybe a '(see below)' or something similar. As it stands, your statement in item 2 is a contradiction; the postal code characters are Unicode characters; they are just particularly common/simple ones.
  • Solomon Rutzky
    Solomon Rutzky over 9 years
    @JonathanLeffler Understood. And I agree that statements should be as clear as possible, so I have updated #2 to do both things: a) updated the text, and b) referenced the bottom section. Thanks again.