How to detect if a file has a UTF-8 BOM in Bash?

18,202

Solution 1

First, let's demonstrate that head is actually working correctly:

$ printf '\xef\xbb\xbf' >file
$ head -c 3 file 
$ head -c 3 file | hexdump -C
00000000  ef bb bf                                          |...|
00000003

Now, let's create a working function has_bom. If your grep supports -P, then one option is:

$ has_bom() { head -c3 "$1" | LC_ALL=C grep -qP '\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

Currently, only GNU grep supports -P.

Another option is to use bash's $'...':

$ has_bom() { head -c3 "$1" | grep -q $'\xef\xbb\xbf'; }
$ has_bom file && echo yes
yes

ksh and zsh also support $'...' but this construct is not POSIX and dash does not support it.

Notes:

  1. The use of an explicit return $? is optional. The function will, by default, return with the exit code of the last command run.

  2. I have used the POSIX form for defining functions. This is equivalent to the bash form but gives you one less problem to deal with if you ever have to run the function under another shell.

  3. bash does accept the use of the character - in a function name but this is a controversial feature. I replaced it with _ which is more widely accepted. (For more on this issue, see this answer.)

  4. The -q option to grep makes it quiet, meaning that it still sets a proper exit code but it does not send any characters to stdout.

Solution 2

I applied the followings for the first read line:

read c
if (( "$(printf "%d" "'${c:0:1}")" == 65279 ))  ; then c="${c:1}" ; fi

This simply removes the BOM from the variable.

Share:
18,202

Related videos on Youtube

James Ko
Author by

James Ko

Updated on June 14, 2022

Comments

  • James Ko
    James Ko almost 2 years

    I'm trying to write a script that will automatically remove UTF-8 BOMs from a file. I'm having trouble detecting whether the file has one in the first place or not. Here is my code:

    function has-bom {
        # Test if the file starts with 0xEF, 0xBB, and 0xBF
        head -c 3 "$1" | grep -P '\xef\xbb\xbf'
        return $?
    }
    

    For some reason, head seems to be ignoring the BOM in front of the file. As an example, running this

    printf '\xef\xbb\xbf' > file
    head -c 3 file
    

    won't print anything.

    I tried looking for an option in head --help that would let me work around this, but no luck. Is there anything I can do to make this work?

  • James Ko
    James Ko over 8 years
    Huh, never knew Bash supported hex string literals. Anyways, thanks for the great answer!
  • CrazyFrog
    CrazyFrog about 6 years
    hi, may i ask in line head -c 3 file | hexdump -c , what does the -c do? The previous one seems to 1) limit number of characters output 2) restrict line number (maybe) to 0000000 and 0000003; but the latter makes the output, which is supposed to be "be bf" etc., into replacement marker. I am using bash and testing on a text file generated under Windows, original encoding=GB18030. Thanks.
  • John1024
    John1024 about 6 years
    @CrazyFrog head -c 3 file writes the first three characters of file to standard out. hexdump -C formats those those characters in a human-friendly way as hexadecimal.
  • CrazyFrog
    CrazyFrog about 6 years
    @John1024 thank you I found the manual! It’s weird though I generate BOM specific with code at the beginning of my text file but this command does not see it.
  • John1024
    John1024 about 6 years
    @CrazyFrog Without seeing the code, I can't tell why it is not generating a BOM for you. You might want to open a question showing the code in detail and documenting the seemingly missing BOM issue.
  • CrazyFrog
    CrazyFrog about 6 years
    @John1024 That is probably what I should do. I will review the code myself again. There has to be a mistake on my part. Thank you for your help!