How to remove %EF%BB%BF in a PHP string
Solution 1
You could use substr
to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);
Solution 2
You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
- In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
- The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
- Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);
Solution 3
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}
Solution 4
It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.
bbnn
Updated on July 18, 2022Comments
-
bbnn almost 2 years
I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav"); $data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?
-
Lee over 13 yearsNote: generally speaking, throwing away the BOM is not a good idea. The BOM is there to tell you how the rest of the string should be handled. If you just ignore it, assuming that it's a UTF-8 3-byte BOM, you're setting yourself up for some real problems if/when the encoding ever changes. ... Please have a look at my answer below for more details.
-
crdx over 11 yearsTo future googlers: use this solution instead. Throwing away the BOM is a bad idea.
-
mpen over 10 yearsThis doesn't appear to work in practice.
mb_convert_encoding("\357\273\277some text", 'ASCII', mb_detect_encoding("\357\273\277some text"))
yieldsstring(10) "?some text"
. Notice that it left a question mark in the output. -
Lee over 10 years@mark Unfortunately, that does appear to be true. I had better luck using
iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE',"\357\273\277some text")
to do the converstion. I guessmb_detect_encoding
would be used to detect the initial charset, which would then be passed as the first arg toiconv
. This is more of a hack than it should be. -
naw103 over 9 years@mark I had to add the following line to get rid of the ? : ini_set('mbstring.substitute_character', "none");
-
Gkiokan over 7 yearsThis solution is great! It helped me a lot when I couldn't find the answer. The best logical part is to make a condition, if the BOM Hex Charakters exists, and then delete them. This code seems future save, even when the Server will not send BOM, this function will still works. +1