Process a file that starts with a BOM (FF FE)

13,744

Solution 1

From this wikipedia article, FF FE means UTF16LE. So you should tell iconv to convert from UTF16LE to UTF8:

iconv -f UTF-16LE -t UTF-8 dotan.csv > fixed.txt

Solution 2

dos2unix also removes BOMs and converts UTF-16 to UTF-8:

$ printf %s あ|recode ..utf16 >a;xxd -p a;dos2unix a;xxd -p a
feff3042
dos2unix: converting file a to Unix format...
e38182

dos2unix also removes UTF-8 BOMs:

$ printf %b '\xef\xbb\xbfa'>a;dos2unix a;xxd -p a
dos2unix: converting file a to Unix format...
61

Solution 3

Also answered on StackOverflow: How can I remove the BOM from a UTF-8 file? @rici has a good answer.

Short answer:

  • Short answer: sed -i $'1s/^\uFEFF//' file.txt, but not on BSD or OS/X.
  • Another answer: vi file.txt, :set nobomb, :w, simple but manual
  • Install dos2unuix; dos2unix -r file.txt
  • These marks have several possible meanings, including just that the file is UTF-8; see the Wikipedia Article.
  • Windows programs love to add these marks. Most editors will not remove these marks.
Share:
13,744

Related videos on Youtube

dotancohen
Author by

dotancohen

Updated on September 18, 2022

Comments

  • dotancohen
    dotancohen over 1 year

    I received a .csv file with the FF FE BOM:

    $ head -n1 dotan.csv | hd
    00000000  ff fe 41 00 64 00 20 00  67 00 72 00 6f 00 75 00  |..A.d. .g.r.o.u.|
    

    When using awk to parse it I'm getting a bunch of null bytes, which I suspect is due to the byte order. How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?

    Note that I think that this file is only ASCII characters (except for the BOM), but I cannot confirm that as grep thinks that it is a binary file:

    $ grep -P '^[\x00-\x7f]' dotan.csv 
    Binary file dotan.csv matches
    

    Searching for the same string in VIM shows every character matching!

    Using iconv to convert to ASCII does not get rid of \x00 values, actually it makes the problem worse as now they look like null bytes instead of UTF-8!

    $ iconv -f UTF-8 -t ASCII dotan.csv > fixed.txt 
    iconv: illegal input sequence at position 0
    
    $ iconv -f UTF-8 -t ASCII//IGNORE dotan.csv > fixed.txt
    
    $ head -n1 fixed.txt | hd
    00000000  41 00 64 00 20 00 67 00  72 00 6f 00 75 00 70 00  |A.d. .g.r.o.u.p.|
    

    How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?

    • cuonglm
      cuonglm almost 10 years
      The CSV file you created in Windows or Mac?
    • cuonglm
      cuonglm almost 10 years
      Can you give a portion of file?
    • dotancohen
      dotancohen almost 10 years
      Here is a link to an anonymized portion of the file which preserves the unique problems with it. Thank you!
  • dotancohen
    dotancohen almost 10 years
    Perfect, thank you! I had the UTF-8 and UTF-16 BOM mixed up: I thought that FFFE and FEFF were UTF-8 and I never knew the UTF-16 BOM(s). Actually, those are UTF-16 BOMs, and I never knew the (useless) UTF-8 BOM!.
  • cuonglm
    cuonglm almost 10 years
    @dotancohen: I test in my Fedora and the tail solution works fine. What OS do you use?
  • Kenny Evitt
    Kenny Evitt about 7 years
    This doesn't work (i.e. remove the BOM) for version "iconv (GNU libiconv 1.14)" in Git Bash on Windows. But (for whatver reason) using just UTF-16 instead of one of the byte-order versions works.