Can Goutte/Guzzle be forced into UTF-8 mode?

11,232

Solution 1

The issue is actually with symfony/browser-kit and symfony/domcrawler. The browserkit's Client does not examine the HTML meta tags to determine the charset, content-type header only. When the response body is handed over to the domcrawler, it is treated as the default charset ISO-8859-1. After examining the meta tags that decision should be reverted and the DomDocument rebuilt, but that never happens.

The easy workaround is to wrap $crawler->text() with utf8_decode():

$text = utf8_decode($crawler->text());

This works if the input is UTF-8. I suppose for other encodings something similar can be achieved with iconv() or so. However, you have to remember to do that every time you call text().

A more generic approach is to make the Domcrawler believe that it deals with UTF-8. To that end I've come up with a Guzzle plugin that overwrites (or adds) the charset in the content-type response header. You can find it at https://gist.github.com/pschultz/6554265. Usage is like this:

<?php

use Goutte\Client;


$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf-8');

$client = new Client();
$client->getClient()->addSubscriber($plugin);
$crawler = $client->request('get', $url);

echo $crawler->text();

Solution 2

I seem to have been hitting two bugs here, one of which was identified by Peter's answer. The other was the way in which I am separately using the Symfony Crawler class to explore HTML snippets.

I was doing this (to parse the HTML for a table row):

$subCrawler = new Crawler($rowHtml);

Adding HTML via the constructor, however, does not appear to give a way in which the character set can be specified, and I assume ISO-8859-1 is again the default.

Simply using addHtmlContent gets it right; the second parameter specifies the character set, and it defaults to UTF-8 if it is not specified.

$subCrawler = new Crawler();
$subCrawler->addHtmlContent($rowHtml);

Solution 3

Crawler tries detect charset from <meta charset tag but frequently it's missing and then Crawler uses charset by default (ISO-8859-1) - it is source of problem described in this thread.

When we are passing content to Crawler through constructor we miss Content-Type header that usually contains charset.

Here's how we can handle it:

$crawler = new Crawler();
$crawler->addContent(
    $response->getBody()->getContents(), 
    $response->getHeaderLine('Content-Type')
);

With this solution we are using correct charset from server response and don't bind our solution to any single charset and of course after that we don't need decode every single received line from Crawler (using utf8_decode() or somehow else).

Share:
11,232
halfer
Author by

halfer

I'm a (mainly PHP) contract software engineer, with interests in containerisation, testing, automation and culture change. At the time of writing I am on a sabbatical to learn some new things - currently on the radar are Modern JavaScript, Jest and Kubernetes. I wrote a pretty good PHP tutorial, feedback on that is always welcome. I often scribble down software ideas on my blog. My avatar features a sleepy fur bundle that looks after me. I've written about how to ask questions on StackOverflow. I don't spend as much time answering questions these days - I think my time is better spent guiding people how to ask. I try to look after beginners on the platform - if anyone reading this has had a "baptism of fire", don't worry about it - it gets easier. If you'd like to get in touch, find the 'About' page of my blog: there's an email address there.

Updated on June 09, 2022

Comments

  • halfer
    halfer almost 2 years

    I'm scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    

    However, the content type header is thus:

    Content-Type: text/html
    

    and not:

    Content-Type: text/html; charset=utf-8
    

    Thus, when I scrape, Goutte does not spot that it is UTF-8, and grabs data incorrectly. The remote site is not under my control, so I can't fix the problem there! Here's a set of scripts to replicate the problem. First, the scraper:

    <?php
    
    require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar';
    
    $url = 'http://crawler-tests.local/utf-8.php';
    use Goutte\Client;
    
    $client = new Client();
    $crawler = $client->request('get', $url);
    $text = $crawler->text();
    echo 'Whole page: ' . $text . "\n";
    

    Now a test page to be placed on a web server:

    <?php
    // Correct
    #header('Content-Type: text/html; charset=utf-8');
    
    // Incorrect
    header('Content-Type: text/html');
    ?>  
    <!DOCTYPE html>
    <html>
        <head>
            <title>UTF-8 test</title>
            <meta charset="utf-8" />
        </head>
        <body>
            <p>When the Content-Header header is incomplete, the pound sign breaks:
    
            £15,216</p>
        </body>
    </html>
    

    Here's the output of the Goutte test:

    Whole page: UTF-8 test When the Content-Header header is incomplete, the pound sign breaks: £15,216

    As you can see from the comments in the last script, properly declaring the character set in the header fixes things. I've hunted around in Goutte to see if there is anything that looks like it would force the character set, but to no avail. Any ideas?

  • halfer
    halfer over 10 years
    This looks great, Peter - many thanks. I'll give it a whirl tomorrow and will let you know how I get on! I've updated the question with the faulty output of the test I wrote - the £ sign is corrupted.
  • halfer
    halfer over 10 years
    Just tried this in standalone mode, works fine. Again, thanks! I was thinking of using the workaround you suggested, but it seemed somewhat inelegant - your plugin is much nicer.
  • Peter
    Peter over 10 years
    Symfony 2.3.5 contains a commit Crawler guess charset from html. That seems to tackle the issue.
  • halfer
    halfer over 10 years
    Great, appreciated. I think it'll take a little longer to reach Goutte, but I'll keep an eye on it.
  • mithataydogmus
    mithataydogmus over 10 years
    You saved me from a lot of tests. Thanks.
  • menjaraz
    menjaraz about 9 years
    Do subsequent releases of symfony/browser-kit and symfony/domcrawler address the issue?
  • Peter
    Peter about 9 years
    As mentioned above this is fixed starting with v2.3.5 and has also been backported to v2.2.7+
  • Mohammed Abrar Ahmed
    Mohammed Abrar Ahmed almost 8 years
    When I run this ,I am getting error saying PHP Fatal error: Class App\Console\Commands\ForceCharsetPlugin not found and [Symfony\Component\Debug\Exception\FatalErrorException] Class App\Console\Commands\ForceCharsetPlugin not found
  • Umair Ayub
    Umair Ayub over 7 years
    holy crap ... you saved me from killing my self ... I was figuring this out for whole day
  • halfer
    halfer over 7 years
    No worries @Umair, pleased it was of assistance!
  • halfer
    halfer over 7 years
    @MohdAbrarAhmed: you'd need to include that class or add it to your autoloading system.