How to remove multiple UTF-8 BOM sequences

85,529

Solution 1

you would use the following code to remove utf8 bom

//Remove UTF8 Bom

function remove_utf8_bom($text)
{
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}

Solution 2

try:

// -------- read the file-content ----
$str = file_get_contents($source_file); 

// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str); 

// -------- get the Object from JSON ---- 
$obj = json_decode($str); 

:)

Solution 3

Another way to remove the BOM which is Unicode code point U+FEFF

$str = preg_replace('/\x{FEFF}/u', '', $file);

Solution 4

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:

"\xef\xbb\xbf"

Your files also seem to contain a lot more garbage than just a single leading BOM:

$ curl http://ircb.in/jisti/ | xxd

0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef  ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068  .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561  tml>.<html>.<hea
...

Solution 5

if anybody using csv import then below code useful

$header = fgetcsv($handle);
foreach($header as $key=> $val) {
     $bom = pack('H*','EFBBBF');
     $val = preg_replace("/^$bom/", '', $val);
     $header[$key] = $val;
}
Share:
85,529
sheppardzw
Author by

sheppardzw

Updated on July 05, 2022

Comments

  • sheppardzw
    sheppardzw almost 2 years

    Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.

    private function fetch($name) {
        $path = $this->j->config['template_path'] . $name . '.html';
        if (!file_exists($path)) {
            dbgerror('Could not find the template "' . $name . '" in ' . $path);
        }
        $f = fopen($path, 'r');
        $t = fread($f, filesize($path));
        fclose($f);
        if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
            $t = substr($t, 3);
        }
        return $t;
    }
    

    Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)

    Any idea how to fix this? o_o

  • sheppardzw
    sheppardzw about 12 years
    if I was using n++, why would it cause this? it's saving it as unix/utf8 -bom
  • Gromski
    Gromski about 12 years
    Save it as UTF-8 NO BOM (or whatever it's called in N++).
  • sheppardzw
    sheppardzw about 12 years
    I did and I'm still getting the same result. I curl'd the direct file (curl ircb.in/jisti/home.html | xxd) and got no leading characters, but curl'ing the PHP script adds the excess data in the front and all I'm using is print to output the data.
  • Artem Russakovskii
    Artem Russakovskii about 7 years
    For some reason in the Google+ API, this BOM shows up at the end of the content variable, so I needed to tweak this to remove it from the end of the string.
  • Priyath Gregory
    Priyath Gregory over 5 years
    Can someone explain how the pack function is used here? I know it converts a string to a binary representation but struggling to understand how this helps with identifying the BOM Unicode character.
  • Trevor
    Trevor over 5 years
    This worked great for my requirement to read the CSV output from SSRS and append to a larger file.
  • Scott
    Scott almost 4 years
    if they can show up more than once, you might want to use"/^(\xEF\xBB\xBF)+/"
  • Christopher Schultz
    Christopher Schultz about 3 years
    I used this with trim to cleanse copy/pasted form data like this: $bom = pack('H*','EFBBBF'); $replacementChars = " \n\r\t\v\0" . $bom; $cleanVar = trim($dirtyVar, $replacementChars);.
  • Dan
    Dan about 3 years
    @fsociety The BOM is three bytes - 0xef 0xbb 0xbf. So pack is is using a format of H* which means interpret all values in the string as hexadecimal bytes. I prefer o1max's answer (although has a lower score) that simply uses a string with escape characters:"\xEF\xBB\xBF"