How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?
Solution 1
Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:
CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2
AS
encoded VARCHAR2(32767);
type array_t IS varray(3) OF VARCHAR2(15);
array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');
BEGIN
FOR I IN 1..array.count LOOP
encoded := CASE array(i)
WHEN 'AL32UTF8' THEN string
ELSE CONVERT(string, 'AL32UTF8', array(i))
END;
IF instr(
rawtohex(
utl_raw.cast_to_raw(
utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')
)
),
'EFBFBD'
) = 0 THEN
RETURN encoded;
END IF;
END LOOP;
RAISE VALUE_ERROR;
END;
Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.
Solution 2
to check whether your database column contains invalid utf-8 use the following query:
select CASE
INSTR (
RAWTOHEX (
utl_raw.cast_to_raw (
utl_i18n.raw_to_char (
utl_raw.cast_to_raw ( <your_column> )
, 'utf8'
)
)
)
, 'EFBFBD'
)
WHEN 0 THEN 'OK'
ELSE 'FAIL'
END
from <your_table>
;
given that your db charset is al32utf8.
note that EF BF BD
represents an illegal encoding in utf-8.
as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.
best regards, carsten
ps:
oracle names for charsets:
CP1252
-> WE8MSWIN1252
LATIN-1
-> WE8ISO8859P1
Comments
-
theory almost 2 years
For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to
AL32UTF8
—that contain some non-UTF-8 values. The values are always in one of these character sets:- US-ASCII
- UTF-8
- CP1252
- Latin-1
I've written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.
However, all I can find for this is the
CONVERT
function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.Therefor, I have two questions:
- Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
- And if not, is there some other interface that indicates failure if it's not able to convert a string to an encoding? If so, then I could write the previous function.
UPDATE:
For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:
CREATE OR REPLACE FUNCTION encoding_utf8( bytea ) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$ DECLARE encoding TEXT; BEGIN FOREACH encoding IN ARRAY ARRAY[ 'UTF8', 'WIN1252', 'LATIN1' ] LOOP BEGIN RETURN convert_from($1, encoding); EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN CONTINUE; END; END LOOP; END; $$;
I'd dearly love to know how to do the equivalent in Oracle.
-
theory over 11 yearsYeah, we don’t know what the original character set was, so I just want to get the values UTF-8 clean. Based on your suggestion, as well as an initial implementation by a co-worker, I've come up with a function I think is pretty close to what I need. I will post it in a separate answer.
-
collapsar over 11 yearswhat happens in your code is that you first try to convert your input data to al32utf8 by caling 'convert', thereafter checking whether the operation succeeded. however, for byte-oriented character sets - which cp1252 happens to be, each encoding has the length of exactly 1 byte - conversion to unicode will never fail. thus your check will succeed and the function
reencode
will exit. please note that it is impossible to distiguish between byte-encoded source charsets by means of successful conversion to unicode - you need contextual information to do that. regards. -
collapsar over 11 years(cont'd). 1.) technically, my statement only applies to (byte-encoded) charsets whose glyphs are incorporated into unicode. i am not aware of any charset that wouldn't meet this criterion (hints appreciated). 2.) to identify the source charset, you may 2a.) in the particular case of latin-1 vs cp1252 check for bytes that are not mapped to glyphs in latin-1 (0x7f-0x9f) or 2b.) in general check for sequences instead of individual chars. example: A4 -> EURO (latin-15) / CURRENCY (cp1252). the latter won't occur after numbers in ordinary texts, so
<numbers> A4
would indicate latin-15. -
theory over 11 yearsAlas, I have no contextual information, and so am just cleaning up some old stuff. A few thousand converted to CP1252 out of over a billion records is something we can live with.