How can I check that the trademark(™) character is set correctly in my Oracle database?

21,800

Solution 1

The character literal ™ you posted is not U+0099 (a control character), but U+2122 (TRADE MARK SIGN).

The Unicode spec defines U+0099 as follows:

0099;<control>;Cc;0;BN;;;;;N;;;;;

So, it doesn't even have a name and I haven't gone digging round the spec to find out what this character is for.

Decoding U+0099 in Windows does result in a trademark grapheme. I guess this is a bug.

The correct byte sequence for the TRADE MARK SIGN (U+2122) in UTF-8 is E2 84 A2.

Solution 2

Just for any future reference, because the author did not bother posting a fix. It's indeed an Informatica problem, what is needed:

  1. Change connection properties in odbc.ini on Informatica box, add "IANAAppCodePage=106" to the connections that need UTF8.
  2. Change connection properties in Informatica itself for the connection and add "Codepage=Utf-8" in "Connection Manager -> Connections -> Relational -> -> Edit"
Share:
21,800
Mike M. Lin
Author by

Mike M. Lin

In 2015, I was a product manager at KISSmetrics. Now I develop products to sell on Amazon. Before all that, I had a career in software since 2001 – mostly in enterprise business apps.

Updated on August 28, 2020

Comments

  • Mike M. Lin
    Mike M. Lin over 3 years

    How can I check that the trademark(™) character is set correctly in my Oracle database?

    I expect it to be stored using UTF-8 encoding.

    I have a value stored in a Salesforce.com field that looks like this from the GUI (notice the trademark character):

    Chuck Norris's Roundhouse Kick™
    

    I'm using Informatica to replicate it to an Oracle database. My database is set to use the AL32UTF8 encoding.

    How it shows up in SQL Developer

    When I query my table using SQL Developer, the trademark symbol shows up as a rectangle (black border, white fill).

    How it shows up in HTML

    When I export it from SQL Developer using the UTF-8 encoding into an HTML document, and open it in Chrome, the trademark symbol does not appear at all. When I open it in IE, the it appears as a rectangle again. In Firefox, it's a rectangle with 00 in the top half and 99 in the bottom half. All three browsers interpret the HTML doc using UTF-8.

    How it shows up in text editors

    Opening the same HTML doc in Notepad and Notepad++, the trademark symbol shows up as a rectangle. If I use the Hex Viewer plugin for Notepad++ I see the byte encoding is C2 99. That seems to be the correct encoding for the trademark symbol in UTF-8.

    When I open document in MS Write, the trademark character looks like this: ™.

    When I get the value programmatically

    Using Python, when I get the value from the database, the trademark character is replaced with '\xbf' -- the inverted question mark, but that character is not even properly encoded as far as I can tell because it's missing at least one leading byte (depending on the specific encoding)

    >>> import cx_Oracle
    >>> con = cx_Oracle.connect('username', 'password', 'db')
    >>> cur = con.cursor()
    >>> cur.execute('select * from trademark')
    <__builtin__.OracleCursor on <cx_Oracle.Connection to username@db>>
    >>> records = cur.fetchall()
    >>> records[0][0]
    "Chuck Norris's Roundhouse Kick\xbf"
    

    Ideally, I'd like to be able to validate the data stored in my Oracle database using all of the above methods. I'd settle for someone just validating that what I saw in the Hex Viewer was enough of a "test" ;)

  • Mike M. Lin
    Mike M. Lin over 12 years
    You're right. The value is not being encoded correctly when written to the database by Informatica. Writing the value from my own test program, I was able to retrieve the trademark symbol. We're looking into what went wrong in the write, or the handling of the value before the write.
  • McDowell
    McDowell over 12 years
    Since U+2122 encodes as 0x99 in windows-1252, I would guess something is using a default "ANSI" encoding during some transcoding operation.