Process a file that starts with a BOM (FF FE)

text-processing character-encoding unicode

13,744

Solution 1

From this wikipedia article, FF FE means UTF16LE. So you should tell iconv to convert from UTF16LE to UTF8:

iconv -f UTF-16LE -t UTF-8 dotan.csv > fixed.txt

Solution 2

dos2unix also removes BOMs and converts UTF-16 to UTF-8:

$ printf %s あ|recode ..utf16 >a;xxd -p a;dos2unix a;xxd -p a
feff3042
dos2unix: converting file a to Unix format...
e38182

dos2unix also removes UTF-8 BOMs:

$ printf %b '\xef\xbb\xbfa'>a;dos2unix a;xxd -p a
dos2unix: converting file a to Unix format...
61

Solution 3

Also answered on StackOverflow: How can I remove the BOM from a UTF-8 file? @rici has a good answer.

Short answer:

Short answer: sed -i $'1s/^\uFEFF//' file.txt, but not on BSD or OS/X.
Another answer: vi file.txt, :set nobomb, :w, simple but manual
Install dos2unuix; dos2unix -r file.txt
These marks have several possible meanings, including just that the file is UTF-8; see the Wikipedia Article.
Windows programs love to add these marks. Most editors will not remove these marks.

13,744

dotancohen

Updated on September 18, 2022

Comments

dotancohen over 1 year
I received a .csv file with the FF FE BOM:
```
$ head -n1 dotan.csv | hd
00000000  ff fe 41 00 64 00 20 00  67 00 72 00 6f 00 75 00  |..A.d. .g.r.o.u.|
```
When using awk to parse it I'm getting a bunch of null bytes, which I suspect is due to the byte order. How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?

Note that I think that this file is only ASCII characters (except for the BOM), but I cannot confirm that as grep thinks that it is a binary file:
```
$ grep -P '^[\x00-\x7f]' dotan.csv 
Binary file dotan.csv matches
```
Searching for the same string in VIM shows every character matching!

Using iconv to convert to ASCII does not get rid of \x00 values, actually it makes the problem worse as now they look like null bytes instead of UTF-8!
```
$ iconv -f UTF-8 -t ASCII dotan.csv > fixed.txt 
iconv: illegal input sequence at position 0

$ iconv -f UTF-8 -t ASCII//IGNORE dotan.csv > fixed.txt

$ head -n1 fixed.txt | hd
00000000  41 00 64 00 20 00 67 00  72 00 6f 00 75 00 70 00  |A.d. .g.r.o.u.p.|
```
How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?
- cuonglm almost 10 years
  
  The CSV file you created in Windows or Mac?
- cuonglm almost 10 years
  
  Can you give a portion of file?
- dotancohen almost 10 years
  
  Here is a link to an anonymized portion of the file which preserves the unique problems with it. Thank you!
dotancohen almost 10 years

Perfect, thank you! I had the UTF-8 and UTF-16 BOM mixed up: I thought that FFFE and FEFF were UTF-8 and I never knew the UTF-16 BOM(s). Actually, those are UTF-16 BOMs, and I never knew the (useless) UTF-8 BOM!.
cuonglm almost 10 years

@dotancohen: I test in my Fedora and the tail solution works fine. What OS do you use?
Kenny Evitt about 7 years

This doesn't work (i.e. remove the BOM) for version "iconv (GNU libiconv 1.14)" in Git Bash on Windows. But (for whatver reason) using just UTF-16 instead of one of the byte-order versions works.