Using awk to remove the Byte-order mark

83,519

Solution 1

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

Solution 2

Using GNU sed (on Linux or Cygwin):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

On FreeBSD:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

On Mac:

This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.

A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

awk '…' INFILE | sponge INFILE

Solution 3

Not awk, but simpler:

tail -c +4 UTF8 > UTF8.nobom

To check for BOM:

hd -n 3 UTF8

If BOM is present you'll see: 00000000 ef bb bf ...

Solution 4

In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:

dos2unix *.txt

dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:

$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
   bom-utf8 efbbbfc3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
   bom-utf8 c3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a

Solution 5

I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:

Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).

It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).

Share:
83,519

Related videos on Youtube

Boldewyn
Author by

Boldewyn

Boldewyn is the name of the donkey in German fables (at least, the ones from Goethe). Despite its bad name, a donkey is an animal with its own head and quite a portion of wit. As someone pointed out once, if you say to a horse to jump down that abyss, it would happily do so, whereas the donkey would give you a kick where you deserve it to. And someone else pointed out, that laziness is a core requirement for a good developer...

Updated on July 08, 2022

Comments

  • Boldewyn
    Boldewyn almost 2 years

    How would an awk script (presumably a one-liner) for removing a BOM look like?

    Specification:

    • print every line after the first (NR > 1)
    • for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
  • Boldewyn
    Boldewyn almost 15 years
    It seems that the dot in the middle of the sub statement is too much (at least, my awk complains about it). Beside this it's exactly what I searched, thanks!
  • Boldewyn
    Boldewyn almost 15 years
    This solution, however, works only for UTF-8 encoded files. For others, like UTF-16, see Wikipedia for the corresponding BOM representation: en.wikipedia.org/wiki/Byte_order_mark
  • Brandon Rhodes
    Brandon Rhodes over 14 years
    I agree with the earlier comment; the dot does not belong in the middle of this statement and makes this otherwise great little snippet an example of an awk syntax error.
  • Axel M. Garcia
    Axel M. Garcia about 14 years
    So: awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' INFILE > OUTFILE and make sure INFILE and OUTFILE are different!
  • tchrist
    tchrist over 12 years
    BOMs are 2 bytes for UTF-16 and 4 bytes for UTF-32, and of course have no business being in UTF-8 in the first place.
  • tchrist
    tchrist over 12 years
    If you used perl -i.orig -pe 's/^\x{FFFE}//' badfile you could rely on your PERL_UNICODE and/or PERLIO envariables for the encoding. PERL_UNICODE=SD would work for UTF-8; for the others, you’d need PERLIO.
  • Karoly Horvath
    Karoly Horvath about 12 years
    @tchrist: from wikipedia: "The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8."
  • tchrist
    tchrist about 12 years
    @KarolyHorvath Yes, precisely. Its use is not recommended. It breaks stuff. The encoding should be specified by a higher-level protocol.
  • Karoly Horvath
    Karoly Horvath about 12 years
    @tchrist: you mean it breaks broken stuff? :) proper apps should be able to handle that BOM.
  • tchrist
    tchrist about 12 years
    @KarolyHorvath I mean it breaks lots of programs. Isn’t that what I said? When you open a stream in the UTF-16 or UTF-32 encodings, the decoder knows not to count the BOM. When you use UTF-8, decoders present the BOM as data. This is a syntax error in innumerable programs. Even Java’s decoder behaves this way, BY DESIGN! BOMs on UTF-8 files are misplaced and a pain in the butt: they are an error! They break many things. Even just cat file1.utf8 file2.utf8 file3.utf3 > allfiles.utf8 will be broken. Never use a BOM on UTF-8. Period.
  • Karoly Horvath
    Karoly Horvath about 12 years
    @@tchrist:: that's what you said... and I said something else. BTW Java's decoder isn't broken by design. it's a bug and they kept it for backward compatibility.
  • mklement0
    mklement0 over 11 years
    hd is not available on OS X (as of 10.8.2), so to check for an UTF-8 BOM there you can use the following: head -c 3 file | od -t x1.
  • Hakanai
    Hakanai over 11 years
    I tried the second command precisely on Mac OS X and the result was "success", but the substitution didn't actually occur.
  • Denilson Sá Maia
    Denilson Sá Maia over 11 years
    It is worth noting these commands replace one specific byte sequence, which is one of the possible byte-order-marks. Maybe your file had a different BOM sequence. (I can't help other than that, as I don't have a Mac)
  • TrueY
    TrueY almost 11 years
    Maybe a little bit shorter version: awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1'
  • Benoit Duffez
    Benoit Duffez over 9 years
    if [[ "file a.txt | grep -o 'with BOM'" == "BOM" ]]; can also be used
  • Hoàng Long
    Hoàng Long almost 9 years
    I am so happy when I found your solution, however I don't have the privilege to install software on company computer. Took lots of time today, until I figure out the alternative: Using Notepad++ with PythonScript plugin . superuser.com/questions/418515/… Thanks anyway!
  • John Wiseman
    John Wiseman over 8 years
    When I tried the second command on OS X on a file that used 0xef 0xbb 0xbf as the BOM, it did not actually do the substitution.
  • Seldom 'Where's Monica' Needy
    Seldom 'Where's Monica' Needy almost 8 years
    hexdump and xdd should work in place of hd, if that's not available on your system.
  • Ian
    Ian over 7 years
    In OSX, I could only get this to work via perl, as shown here: stackoverflow.com/a/9101056/2063546
  • user963601
    user963601 over 7 years
    Works great on OS X El Capitan 10.11.6.
  • user963601
    user963601 over 7 years
    On OS X El Capitan 10.11.6, this doesn't work, but the official answer stackoverflow.com/a/1068700/9636 works fine.
  • OroshiX
    OroshiX almost 6 years
    /!\ Both commands erased my file, as a "side-effect" of changing the encoding... Quite fortunate to have had them backed up first.
  • Blayzeing
    Blayzeing over 3 years
    If you're trying to just change the file (not create a new one) and for some reason can't use sed (as per the answer below), make sure to use -i inplace and not put the input file as the output file, which will erase the file!