Why can't I convert a UTF-8 to MS-ANSI using iconv?
$ printf '\xef\xbb\xbf' | uconv -x any-name
\N{ZERO WIDTH NO-BREAK SPACE}
That's a character (U+FEFF, encoded in 3 bytes in UTF-8) which is also used as byte-order-mark. In any case, that character is not found in MS-ANSI (an improper name sometimes given to windows-1252, a superset of iso8859-1) so cannot be converted to that.
BOM are used (at the beginning of some text) to differentiate UTF16-LE from UTF16-BE (or other non-byte encodings affected by CPU endianness). It makes no sense in UTF-8 where there's no byte order ambiguity, it would make even less sense in windows-1252 which is a single-byte character charset. As a "zero width no-break space", it's also invisible and has no word-separation property like the "zero width space" character would have, so it's probably safe to remove it altogether.
With zsh
, bash
or ksh93
:
sed $'s/\ufeff//g' < input | iconv -t windows-1252
With some iconv
implementations, you can also use:
iconv -t windows-1252//translit < input
//translit
resorts to approximations when the text cannot be faithfully translated. In that case, it just removes the U+FEFF character.
$ printf '\xef\xbb\xbf\x38\x3a\x6e\x61\x09\x38\x3a' |
iconv -t windows-1252//translit | hd
00000000 38 3a 6e 61 09 38 3a |8:na.8:|
00000007
Another option could be to use:
iconv -t utf-16le | iconv -f utf-16 -t windows-1252
The first iconv
converts to UTF-16 little-endian without BOM, but that initial U+FEFF makes it actual UTF-16 with BOM, so the second iconv
strips that BOM as it's used to determine the byte-order of that utf-16
encoding.
Related videos on Youtube
iconvertor
Updated on September 18, 2022Comments
-
iconvertor over 1 year
I am trying to convert a file from utf-8 to ms-ansi.
I use
iconv -f UTF8 -t MS-ANSI// < data.txt
but get
iconv: illegal input sequence at position 171359
when looking into this
dd if=data.txt of=error.txt bs=1 count=10 skip=171359
I get this:
hexdump -C error.txt 00000000 ef bb bf 38 3a 6e 61 09 38 3a |...8:na.8:| 0000000a
is the file not utf-8, and if not, what should I use instead with iconv?