PHP DOM UTF-8 problem

11,054

Solution 1

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

  • Takes the charset specified in a meta http-equiv for "content-type".
  • Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

gives:

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

Solution 2

Two solutions.

You can either set the encoding as a header:

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

Or your can set it as a META tag:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

EDIT: in the event that both of these are set correctly, do the following:

  • Create a small page that has a UTF-8 character in it.
  • Write the page in the same method that you already have.
  • Use Fiddler or Wireshark to examine the raw bytes being transferred in your DEV and PROD environments. You can also double check the headers using Fiddler/Wireshark.

If you are confident that the correct header is being sent, then your best chance of finding the error is to start looking at raw bytes. Identical bytes sent to an identical browser will yield the same result, so you need to start looking for why they are not identical. Fiddler/Wireshark will help with that.

Share:
11,054
Richard Knop
Author by

Richard Knop

I'm a software engineer mostly working on backend from 2011. I have used various languages but has been mostly been writing Go code since 2014. In addition, I have been involved in lot of infra work and have experience with various public cloud platforms, Kubernetes, Terraform etc. For databases I have used lot of Postgres and MySQL but also Redis and other key value or document databases. Check some of my open source projects: https://github.com/RichardKnop/machinery https://github.com/RichardKnop/go-oauth2-server https://github.com/RichardKnop

Updated on June 04, 2022

Comments

  • Richard Knop
    Richard Knop almost 2 years

    First of all, my database uses Windows-1250 as native charset. I am outputting the data as UTF-8. I'm using iconv() function all over my website to convert Windows-1250 strings to UTF-8 strings and it works perfect.

    The problem is when I'm using PHP DOM to parse some HTML stored in the database (the HTML is an output from a WYSIWYG editor and is not valid, it has no html, head, body tags etc).

    The HTML could look something like this, for example:

    <p>Hello</p>
    

    Here is a method I use to parse a certain HTML from the database:

     private function ParseSlideContent($slideContent)
     {
            var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters
    
      $doc = new DOMDocument('1.0', 'UTF-8');
    
      // hack to preserve UTF-8 characters
      $html = iconv('Windows-1250', 'UTF-8', $slideContent);
      $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
      $doc->preserveWhiteSpace = false;
    
      foreach($doc->getElementsByTagName('img') as $t) {
       $path = trim($t->getAttribute('src'));
       $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
      }
      foreach ($doc->getElementsByTagName('object') as $o) {
       foreach ($o->getElementsByTagName('param') as $p) {
        $path = trim($p->getAttribute('value'));
        $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
       }
      }
      foreach ($doc->getElementsByTagName('embed') as $e) {
       if (true === $e->hasAttribute('pluginspage')) {
        $path = trim($e->getAttribute('src'));
        $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
       } else {
        $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
        $path = 'data/media/video/' . $path;
        $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
        $width = $e->getAttribute('width') . 'px';
        $height = $e->getAttribute('height') . 'px';
        $a = $doc->createElement('a', '');
        $a->setAttribute('href', $path);
        $a->setAttribute('style', "display:block;width:$width;height:$height;");
        $a->setAttribute('class', 'player');
        $e->parentNode->replaceChild($a, $e);
        $this->slideContainsVideo = true;
       }
      }
    
      $html = trim($doc->saveHTML());
    
      $html = explode('<body>', $html);
      $html = explode('</body>', $html[1]);
      return $html[0];
     }
    

    The output from the method above is a garbage with all special characters replaced with weird stuff like Ú�.

    One more thing. It does work on my development server.

    It does not work on the production server though.

    Any suggestions?

    PHP version of the production server: PHP Version 5.2.0RC4-dev

    PHP version of the development server: PHP Version 5.2.13


    UPDATE:

    I'm working on a solution myself. I have an inspiration from this PHP bug report (not really a bug though): http://bugs.php.net/bug.php?id=32547

    This is my proposed solution. I will try it tomorrow and let you know if it works:

     private function ParseSlideContent($slideContent)
     {
            var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters
    
      $doc = new DOMDocument('1.0', 'UTF-8');
    
      // hack to preserve UTF-8 characters
      $html = iconv('Windows-1250', 'UTF-8', $slideContent);
      $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
      $doc->preserveWhiteSpace = false;
    
      // this might work
      // it basically just adds head and meta tags to the document
      $html = $doc->getElementsByTagName('html')->item(0);
      $head = $doc->createElement('head', '');
      $meta = $doc->createElement('meta', '');
      $meta->setAttribute('http-equiv', 'Content-Type');
      $meta->setAttribute('content', 'text/html; charset=utf-8');
      $head->appendChild($meta);
      $body = $doc->getElementsByTagName('body')->item(0);
      $html->removeChild($body);
      $html->appendChild($head);
      $html->appendChild($body);
    
      foreach($doc->getElementsByTagName('img') as $t) {
       $path = trim($t->getAttribute('src'));
       $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
      }
      foreach ($doc->getElementsByTagName('object') as $o) {
       foreach ($o->getElementsByTagName('param') as $p) {
        $path = trim($p->getAttribute('value'));
        $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
       }
      }
      foreach ($doc->getElementsByTagName('embed') as $e) {
       if (true === $e->hasAttribute('pluginspage')) {
        $path = trim($e->getAttribute('src'));
        $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
       } else {
        $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
        $path = 'data/media/video/' . $path;
        $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
        $width = $e->getAttribute('width') . 'px';
        $height = $e->getAttribute('height') . 'px';
        $a = $doc->createElement('a', '');
        $a->setAttribute('href', $path);
        $a->setAttribute('style', "display:block;width:$width;height:$height;");
        $a->setAttribute('class', 'player');
        $e->parentNode->replaceChild($a, $e);
        $this->slideContainsVideo = true;
       }
      }
    
      $html = trim($doc->saveHTML());
    
      $html = explode('<body>', $html);
      $html = explode('</body>', $html[1]);
      return $html[0];
     }