Process a file that starts with a BOM (FF FE)
Solution 1
From this wikipedia article, FF FE
means UTF16LE
. So you should tell iconv
to convert from UTF16LE
to UTF8
:
iconv -f UTF-16LE -t UTF-8 dotan.csv > fixed.txt
Solution 2
dos2unix
also removes BOMs and converts UTF-16 to UTF-8:
$ printf %s あ|recode ..utf16 >a;xxd -p a;dos2unix a;xxd -p a
feff3042
dos2unix: converting file a to Unix format...
e38182
dos2unix
also removes UTF-8 BOMs:
$ printf %b '\xef\xbb\xbfa'>a;dos2unix a;xxd -p a
dos2unix: converting file a to Unix format...
61
Solution 3
Also answered on StackOverflow: How can I remove the BOM from a UTF-8 file? @rici has a good answer.
Short answer:
- Short answer:
sed -i $'1s/^\uFEFF//' file.txt
, but not on BSD or OS/X. - Another answer:
vi file.txt
,:set nobomb
,:w
, simple but manual - Install dos2unuix;
dos2unix -r file.txt
- These marks have several possible meanings, including just that the file is UTF-8; see the Wikipedia Article.
- Windows programs love to add these marks. Most editors will not remove these marks.
Related videos on Youtube
dotancohen
Updated on September 18, 2022Comments
-
dotancohen over 1 year
I received a .csv file with the
FF FE
BOM:$ head -n1 dotan.csv | hd 00000000 ff fe 41 00 64 00 20 00 67 00 72 00 6f 00 75 00 |..A.d. .g.r.o.u.|
When using
awk
to parse it I'm getting a bunch of null bytes, which I suspect is due to the byte order. How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?Note that I think that this file is only ASCII characters (except for the BOM), but I cannot confirm that as
grep
thinks that it is a binary file:$ grep -P '^[\x00-\x7f]' dotan.csv Binary file dotan.csv matches
Searching for the same string in VIM shows every character matching!
Using
iconv
to convert to ASCII does not get rid of \x00 values, actually it makes the problem worse as now they look like null bytes instead of UTF-8!$ iconv -f UTF-8 -t ASCII dotan.csv > fixed.txt iconv: illegal input sequence at position 0 $ iconv -f UTF-8 -t ASCII//IGNORE dotan.csv > fixed.txt $ head -n1 fixed.txt | hd 00000000 41 00 64 00 20 00 67 00 72 00 6f 00 75 00 70 00 |A.d. .g.r.o.u.p.|
How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?
-
cuonglm almost 10 yearsThe CSV file you created in Windows or Mac?
-
cuonglm almost 10 yearsCan you give a portion of file?
-
dotancohen almost 10 yearsHere is a link to an anonymized portion of the file which preserves the unique problems with it. Thank you!
-
-
dotancohen almost 10 yearsPerfect, thank you! I had the UTF-8 and UTF-16 BOM mixed up: I thought that FFFE and FEFF were UTF-8 and I never knew the UTF-16 BOM(s). Actually, those are UTF-16 BOMs, and I never knew the (useless) UTF-8 BOM!.
-
cuonglm almost 10 years@dotancohen: I test in my Fedora and the
tail
solution works fine. What OS do you use? -
Kenny Evitt about 7 yearsThis doesn't work (i.e. remove the BOM) for version "iconv (GNU libiconv 1.14)" in Git Bash on Windows. But (for whatver reason) using just
UTF-16
instead of one of the byte-order versions works.