iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

10,111

Solution 1

Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.

Thus, on Linux, this works just fine:

$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'

I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.

Solution 2

What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].

If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.

If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.

(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)

Share:
10,111
Andrew Swift
Author by

Andrew Swift

I started out on an IBM mainframe in Palo Alto in 1981, then a PDP-11 in high school in Amherst, Commodore CBM, etc... finished 1/2 a bachelor's degree in CS at Duke, then decided to major in philosophy. Finally ended up doing graphic design, and in the late 90's when web sites became important got back into programming with javascript, perl, php, actionscript, etc. I love the book "The Adolescence of P1", and live in southern France.

Updated on June 18, 2022

Comments

  • Andrew Swift
    Andrew Swift almost 2 years

    I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).

    Since PHP doesn't directly handle UTF-8, I'm using:

    $value = iconv ('UTF-8', 'ISO-8859-1', $value);
    

    to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).

    However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:

    Unknown error type: [8]
    
    iconv() [function.iconv]: Detected an illegal character in input string
    

    To get rid of the smart quotes before using iconv, I have tried using three statements like:

    $value = str_replace('’', "'", $value);
    

    (’ is the raw value of a UTF-8 smart single quote)

    Because the text file is so long, these str_replace's cause the script to time out every single time.

    1. What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?

    2. Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?

  • Andrew Swift
    Andrew Swift almost 15 years
    I started out using str_replace to replace the offending strings, but it slowed the script down too much ($value = str_replace('’', "'", $value); where ’ is the asci representation of the offending smart single quote). Can you clarify what you mean by CONCAT on CHAR calls?
  • Andrew Swift
    Andrew Swift almost 15 years
    I changed the question to read url-friendly. I can't add extensions to PHP. I checked out the translit library you suggest, but it was about 35% slower than my original solution.
  • Alex Martelli
    Alex Martelli almost 15 years
    I suggested doing the REPLACE in SQL, and using CONCAT(CHAR(...),... to compose the substring you're trying to replace, byte by byte.