Bullet "•" in XML

77,793

XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.

The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:

So you now say it renders as •. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:

Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.

If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:

  • ISO-8859-15 (Latin 9)
  • OEM 858 (Multilingual Latin I + Euro)
  • Windows 1252 (Latin I)
  • Windows 1254 (Turkish)
  • Windows 1256 (Arabic)
  • Windows 1258 (Vietnam)
Share:
77,793
TecBrat
Author by

TecBrat

In contrast with a recent Supreme Court of the United States decision and the silent agreement expressed by the owners of this site by temporarily changing the StackOverflow logo colors to a rainbow, I believe that marriage is a sacred covenant between one man and one woman. No court or legislature on this planet can change that fact. SOreadytohelp TecBrat, the frankencoder: I never expected to be a business owner, but circumstances led me to host and maintain a few websites, so I formed TecBrat.com LLC as a PHP Web Developer and am getting a good start. I'm sure I'll be on SO as much as or more than I was before. Very little organized training (High School AP class in 1991) but several years of experience hacking together pieces of code. I greatly appreciate the willingness of the community to help each other improve their coding skills. Learning JQuery. I am hoping it is a shortcut to learning more JS. So far, I find it helpful and have begun to find extensions / plugins that make it even better. I joined specifically for PHP programming Q&A, but I found myself much more active on EL&U for a while. I have only my own experience and my highschool education to guide my answers on EL&U. (I changed the "f" in "frankencoder" to lower case because the proper name apparently belongs to a guy name Phillip, and I wouldn't want someone else calling themselves "TecBrat")

Updated on April 21, 2020

Comments

  • TecBrat
    TecBrat about 4 years

    Similar to this question I am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.

    The problem is that it contains a bullet. It renders as "•" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "•". (it renders as tall block in my text editor) You can see the commented lines where I tried a few different things.

    I tried str replace on "•" first, then tweaked around and this is my latest:

    // deal with bullets in XML.
    $bullet="•"; //this was copied and pasted from transliterated text.
    //$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
    //$data=str_replace($bullet,'•',$data); // replace the bullet char
    $data=str_replace($bullet,' - ',$data); // replace the bullet char
    //$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
    

    Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.