Why can't I convert a UTF-8 to MS-ANSI using iconv?

5,027

$ printf '\xef\xbb\xbf' | uconv -x any-name
\N{ZERO WIDTH NO-BREAK SPACE}

That's a character (U+FEFF, encoded in 3 bytes in UTF-8) which is also used as byte-order-mark. In any case, that character is not found in MS-ANSI (an improper name sometimes given to windows-1252, a superset of iso8859-1) so cannot be converted to that.

BOM are used (at the beginning of some text) to differentiate UTF16-LE from UTF16-BE (or other non-byte encodings affected by CPU endianness). It makes no sense in UTF-8 where there's no byte order ambiguity, it would make even less sense in windows-1252 which is a single-byte character charset. As a "zero width no-break space", it's also invisible and has no word-separation property like the "zero width space" character would have, so it's probably safe to remove it altogether.

With zsh, bash or ksh93:

sed $'s/\ufeff//g' < input | iconv -t windows-1252

With some iconv implementations, you can also use:

iconv -t windows-1252//translit < input

//translit resorts to approximations when the text cannot be faithfully translated. In that case, it just removes the U+FEFF character.

$ printf '\xef\xbb\xbf\x38\x3a\x6e\x61\x09\x38\x3a' |
    iconv -t windows-1252//translit | hd
00000000  38 3a 6e 61 09 38 3a                              |8:na.8:|
00000007

Another option could be to use:

iconv -t utf-16le | iconv -f utf-16 -t windows-1252

The first iconv converts to UTF-16 little-endian without BOM, but that initial U+FEFF makes it actual UTF-16 with BOM, so the second iconv strips that BOM as it's used to determine the byte-order of that utf-16 encoding.

5,027

iconvertor

Updated on September 18, 2022

Comments

iconvertor over 1 year

I am trying to convert a file from utf-8 to ms-ansi.

I use

  iconv -f UTF8 -t MS-ANSI// < data.txt

but get

  iconv: illegal input sequence at position 171359

when looking into this

 dd if=data.txt of=error.txt bs=1 count=10 skip=171359

I get this:

 hexdump -C error.txt
 00000000  ef bb bf 38 3a 6e 61 09  38 3a                    |...8:na.8:|
 0000000a

is the file not utf-8, and if not, what should I use instead with iconv?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Can not use `cut -c` (`--characters`) with UTF-8?

How to convert hex chars to normal chars?

Convert an ASCII file with octal escapes for UTF-8 codes to UTF-8

Process a file that starts with a BOM (FF FE)

How to convert html entities to readable text?

iconv illegal input sequence- why?

How can I test the encoding of a text file... Is it valid, and what is it?

Filtering invalid utf8

Flutter - Character encoding is not behaving as expected

PHP string to hex

Why can't I convert a UTF-8 to MS-ANSI using iconv?

Related videos on Youtube

iconvertor

Comments

Recents

Related