How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?

92,737

Solution 1

Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:

CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
    encoded VARCHAR2(32767);
    type  array_t IS varray(3) OF VARCHAR2(15);
    array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
    FOR I IN 1..array.count LOOP
        encoded := CASE array(i)
            WHEN 'AL32UTF8' THEN string
            ELSE CONVERT(string, 'AL32UTF8', array(i))
        END;
        IF instr(
            rawtohex(
                utl_raw.cast_to_raw(
                    utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
                )
            ),
            'EFBFBD'
        ) = 0 THEN
            RETURN encoded;
        END IF;
    END LOOP;
    RAISE VALUE_ERROR;
END;

Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.

Solution 2

to check whether your database column contains invalid utf-8 use the following query:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

given that your db charset is al32utf8.

note that EF BF BD represents an illegal encoding in utf-8.

as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.

best regards, carsten

ps: oracle names for charsets: CP1252 -> WE8MSWIN1252 LATIN-1 -> WE8ISO8859P1

Share:
92,737
theory
Author by

theory

Intersectional humanist. He/him.

Updated on July 11, 2022

Comments

  • theory
    theory almost 2 years

    For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8—that contain some non-UTF-8 values. The values are always in one of these character sets:

    • US-ASCII
    • UTF-8
    • CP1252
    • Latin-1

    I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.

    However, all I can find for this is the CONVERT function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.

    Therefor, I have two questions:

    1. Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
    2. And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.

    UPDATE:

    For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:

    CREATE OR REPLACE FUNCTION encoding_utf8(
        bytea
    ) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
    DECLARE
        encoding TEXT;
    BEGIN
        FOREACH encoding IN ARRAY ARRAY[
            'UTF8',
            'WIN1252',
            'LATIN1'
        ] LOOP
            BEGIN
                RETURN convert_from($1, encoding);
            EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN
                CONTINUE;
            END;
        END LOOP;
    END;
    $$;
    

    I'd dearly love to know how to do the equivalent in Oracle.

  • theory
    theory over 11 years
    Yeah, we don’t know what the original character set was, so I just want to get the values UTF-8 clean. Based on your suggestion, as well as an initial implementation by a co-worker, I've come up with a function I think is pretty close to what I need. I will post it in a separate answer.
  • collapsar
    collapsar over 11 years
    what happens in your code is that you first try to convert your input data to al32utf8 by caling 'convert', thereafter checking whether the operation succeeded. however, for byte-oriented character sets - which cp1252 happens to be, each encoding has the length of exactly 1 byte - conversion to unicode will never fail. thus your check will succeed and the function reencode will exit. please note that it is impossible to distiguish between byte-encoded source charsets by means of successful conversion to unicode - you need contextual information to do that. regards.
  • collapsar
    collapsar over 11 years
    (cont'd). 1.) technically, my statement only applies to (byte-encoded) charsets whose glyphs are incorporated into unicode. i am not aware of any charset that wouldn't meet this criterion (hints appreciated). 2.) to identify the source charset, you may 2a.) in the particular case of latin-1 vs cp1252 check for bytes that are not mapped to glyphs in latin-1 (0x7f-0x9f) or 2b.) in general check for sequences instead of individual chars. example: A4 -> EURO (latin-15) / CURRENCY (cp1252). the latter won't occur after numbers in ordinary texts, so <numbers> A4 would indicate latin-15.
  • theory
    theory over 11 years
    Alas, I have no contextual information, and so am just cleaning up some old stuff. A few thousand converted to CP1252 out of over a billion records is something we can live with.