Why can't I convert a UTF-8 to MS-ANSI using iconv?

5,027
$ printf '\xef\xbb\xbf' | uconv -x any-name
\N{ZERO WIDTH NO-BREAK SPACE}

That's a character (U+FEFF, encoded in 3 bytes in UTF-8) which is also used as byte-order-mark. In any case, that character is not found in MS-ANSI (an improper name sometimes given to windows-1252, a superset of iso8859-1) so cannot be converted to that.

BOM are used (at the beginning of some text) to differentiate UTF16-LE from UTF16-BE (or other non-byte encodings affected by CPU endianness). It makes no sense in UTF-8 where there's no byte order ambiguity, it would make even less sense in windows-1252 which is a single-byte character charset. As a "zero width no-break space", it's also invisible and has no word-separation property like the "zero width space" character would have, so it's probably safe to remove it altogether.

With zsh, bash or ksh93:

sed $'s/\ufeff//g' < input | iconv -t windows-1252

With some iconv implementations, you can also use:

iconv -t windows-1252//translit < input

//translit resorts to approximations when the text cannot be faithfully translated. In that case, it just removes the U+FEFF character.

$ printf '\xef\xbb\xbf\x38\x3a\x6e\x61\x09\x38\x3a' |
    iconv -t windows-1252//translit | hd
00000000  38 3a 6e 61 09 38 3a                              |8:na.8:|
00000007

Another option could be to use:

iconv -t utf-16le | iconv -f utf-16 -t windows-1252

The first iconv converts to UTF-16 little-endian without BOM, but that initial U+FEFF makes it actual UTF-16 with BOM, so the second iconv strips that BOM as it's used to determine the byte-order of that utf-16 encoding.

Share:
5,027

Related videos on Youtube

iconvertor
Author by

iconvertor

Updated on September 18, 2022

Comments

  • iconvertor
    iconvertor over 1 year

    I am trying to convert a file from utf-8 to ms-ansi.

    I use

      iconv -f UTF8 -t MS-ANSI// < data.txt
    

    but get

      iconv: illegal input sequence at position 171359
    

    when looking into this

     dd if=data.txt of=error.txt bs=1 count=10 skip=171359
    

    I get this:

     hexdump -C error.txt
     00000000  ef bb bf 38 3a 6e 61 09  38 3a                    |...8:na.8:|
     0000000a
    

    is the file not utf-8, and if not, what should I use instead with iconv?