How to replace Microsoft-encoded quotes in PHP
Solution 1
Considering you only want to replace a few specific and well identified characters, I would go for str_replace
with an array: you obviously don't need the heavy artillery regex will bring you ;-)
And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.
The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP
And the associated code (quoting that page):
function convert_smart_quotes($string)
{
$search = array(chr(145),
chr(146),
chr(147),
chr(148),
chr(151));
$replace = array("'",
"'",
'"',
'"',
'-');
return str_replace($search, $replace, $string);
}
(I don't have Microsoft Word on this computer, so I can't test by myself)
I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...
Solution 2
I have found an answer to this question. You need just one line of code using iconv()
function in php:
// replace Microsoft Word version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ")
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
Solution 3
Your Microsoft-encoded quotes are the probably the typographic quotation marks. You can simply replace them with str_replace
if you know the encoding of the string in that you want to replace them.
Here’s an example for UTF-8 but using a single mapping array with strtr
:
$quotes = array(
"\xC2\xAB" => '"', // « (U+00AB) in UTF-8
"\xC2\xBB" => '"', // » (U+00BB) in UTF-8
"\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
"\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
"\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
"\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
"\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8
"\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8
"\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
"\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
"\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
"\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
);
$str = strtr($str, $quotes);
If you’re need another encoding, you can use mb_convert_encoding
to convert the keys.
Solution 4
If like me you arrive here with an enormous range of broken ASCII / Microsoft Word characters that are doing weird things to your CMS or RTE and iconv isn't working, then this mad function might just be for you.
Make sure your encoding is UTF-8 when you save this function to a file.
<?php
/**
* fixMSWord
*
* Replace ASCII chars with UTF-8. Note there are ASCII characters that don't
* correctly map and will be replaced by spaces.
*
* @author Robin Cafolla
* @date 2013-03-22
*/
function fixMSWord($string) {
$map = Array(
'33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*',
'43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4',
'53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>',
'63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H',
'73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R',
'83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\',
'93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f',
'103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p',
'113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z',
'123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '€', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"',
'133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ',
'143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~',
'153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢',
'163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬',
'173'=> '', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶',
'183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À',
'193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê',
'203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô',
'213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ',
'223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è',
'233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò',
'243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü',
'253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ'
);
$search = Array();
$replace = Array();
foreach ($map as $s => $r) {
$search[] = chr((int)$s);
$replace[] = $r;
}
return str_replace($search, $replace, $string);
}
Solution 5
We used the following. It deals with a few more special characters.
$text = str_replace(chr(130), ',', $text); // Baseline single quote
$text = str_replace(chr(132), '"', $text); // Baseline double quote
$text = str_replace(chr(133), '...', $text); // Ellipsis
$text = str_replace(chr(145), "'", $text); // Left single quote
$text = str_replace(chr(146), "'", $text); // Right single quote
$text = str_replace(chr(147), '"', $text); // Left double quote
$text = str_replace(chr(148), '"', $text); // Right double quote
$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');
Misha M
Updated on July 10, 2020Comments
-
Misha M almost 4 years
I need to replace Microsoft Word's version of single and double quotations marks (
“ ” ‘ ’
) with regular quotes (' and ") due to an encoding issue in my application. I do not need them to be HTML entities and I cannot change my database schema.I have two options: to use either a regular expression or an associated array.
Is there a better way to do this?
-
Misha M over 14 yearsHow would you specify the MS characters?
-
Misha M over 14 yearsThis is what I was looking for. Thanks. The search array did not work as is, I ended up using the Hex version that was provided in the comments from the link you gave above.
-
dotty over 14 yearsThe '&' sign copied from MS word doesn't encode properly, is there anyway we can use this snippet to encode that to '&'. (aswell as bullets and other chars)
-
R.. GitHub STOP HELPING ICE over 13 yearsRather than the ugly
\x
escapes, couldn't you simply include the literal characters in your source file? -
Gumbo over 13 years@R..: That’s the problem: There are many that don’t know enough about character encodings and/or what character encoding they’re using.
-
Drewid over 12 yearsworked a charm thanks. Gotta love importing excel spread sheets into mysql :S +1
-
Justin Dominic almost 12 yearsif my answer helped u can u upvote my original question stackoverflow.com/questions/6597268/…
-
Blazemonger almost 12 yearsFor other users: You might look for
chr(149)
(bullet) and replace it with an asterisk as well. This page has a list of severalchr()
characters you might want to convert. -
Eric Kigathi almost 12 yearsThanks however in my case I needed to pick the right character encoding (which was CP1252 and not UTF-8):
$output = iconv('CP1252', 'ASCII//TRANSLIT', $input);
-
Justin Dominic over 11 years@eric good to know you used your mind on it for others. thanks for sharing :)
-
PHP Connect about 11 yearsCan i use this for project because this is in MIT licenses
-
thelastshadow about 11 yearsIn general the MIT licence lets you use it in whatever way you like, so long as you don't remove the licence :)
-
Ben Sinclair over 10 yearsYep this worked for me. I'd recommend this over the accepted answer :)
-
Ngoc Pham about 10 yearsThis works for me while the accepted answer doesn't. I would like to change this one to accepted answer.
-
JMTyler about 10 yearsYou decided to put a license on what essentially amounts to... an array?
-
thelastshadow about 10 yearsI just copied the code out of the file it was in and pasted it here. I try to put open licences on as much of the code I write as possible, even when all it amounts to is a useful array.
-
NobleUplift about 10 yearsIt doesn't matter what license you place inside an answer, all user content is licensed under cc by-sa 3.0 with attribution required. You can see this in the footer. This code is no longer under the MIT license.
-
thelastshadow about 10 yearsNah, it's dual licenced. Use whichever you feel more comfortable with.
-
NobleUplift about 10 yearsBut are the Microsoft quotes Unicode code points or CP1252 code points? If the latter, this solution will not work. Actually, it will throw a notice:
PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1
. -
NobleUplift about 10 yearsAlso, I should point out that this function is not "fixing ASCII". There are no ASCII characters above 127. The only thing I can see this function doing is mangling Unicode strings.
-
NobleUplift about 10 yearsYou don't check the encoding of the string first, so this function will mangle certain Unicode passed into it.
-
NobleUplift about 10 yearsYou should check the encoding of the string
$text
before you run replaces in it. It could already be a Unicode string and you are mangling it. -
thelastshadow about 10 years@NobleUplift I have renamed it to fixMSWord. I'd agree that it does mangle, but if you have the problem this function fixes it does the job, and I've yet to find another solution.
-
marcvangend over 9 yearsThis worked for me when the accepted answer, for some reason, did not (probably a UTF-8 thing). Thanks.
-
Gumbo over 9 years@marcvangend The accepted answer does not expect UTF-8 but some other single-byte character encoding.
-
eljamz about 7 yearsWorked flawlessly, thank you ! im on UTF-8 charset, files encoded and also utf8-bin in database... thanks!
-
gorillagoat almost 6 yearsafter tearing my hair out trying to figure my encoding issues, this eventually was the ticket for me. i used this (php.net/manual/en/function.chr.php) to extend your function for my own purposes - scroll halfway down to the example posted by Josh B.
-
Neek over 5 yearsThis seems a good (or, lazy) solution to my problem in Zen Cart, customers entering curly quotes in their names when signing up, and ZC stores first and last name in the PHP session, which then fails to decode with "PHP Warning: session_start(): Failed to decode session object. Session has been destroyed" message. I'm going to work around by stripping the strings with
iconv
before saving them to the database during account creation. -
Neek over 5 yearsWARNING: iconv is a PHP extension and may not be installed on your production environment! "
Fatal error: Call to undefined function iconv()
" Be sure to test your code on every platform it needs to run. -
Cutis over 4 yearsThis is a good trick. but I notice this solution removes special characters like é, è, à, â and others. Any solution to shirk the issue?
-
Professor Zoom about 2 yearsJanuary 2022 this worked for me in PHP 5.6