Perl: utf8::decode vs. Encode::decode

17,523

Solution 1

You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.

You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.

length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter

Dump $test;
#  FLAGS = (PADMY,POK,pPOK)
#  PV = 0x8d8520 "\350\253\206\n"\0

$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
#  FLAGS = (PADMY,POK,pPOK,UTF8)
#  PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

Solution 2

Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.

Share:
17,523
Matt
Author by

Matt

Updated on June 14, 2022

Comments

  • Matt
    Matt almost 2 years

    I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.

    What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:

    #!/usr/bin/perl                                                                                                                                          
    use strict;
    use warnings;
    require "Encode.pm";
    require "utf8.pm";
    
    open FILE, "test.txt" or die $!;
    my @lines = <FILE>;
    my $test =  $lines[0];
    
    print "Length: " . (length $test) . "\n";
    print "utf8 flag: " . utf8::is_utf8($test) . "\n";
    my @unicode = (unpack('U*', $test));
    print "Unicode:\n@unicode\n";
    my @hex = (unpack('H*', $test));
    print "Hex:\n@hex\n";
    
    print "==============\n";
    
    $test = Encode::decode("utf8", $test);
    print "Length: " . (length $test) . "\n";
    print "utf8 flag: " . utf8::is_utf8($test) . "\n";
    @unicode = (unpack('U*', $test));
    print "Unicode:\n@unicode\n";
    @hex = (unpack('H*', $test));
    print "Hex:\n@hex\n";
    
    print "==============\n";
    
    $test = Encode::decode("utf8", $test);
    print "Length: " . (length $test) . "\n";
    print "utf8 flag: " . utf8::is_utf8($test) . "\n";
    @unicode = (unpack('U*', $test));
    print "Unicode:\n@unicode\n";
    @hex = (unpack('H*', $test));
    
    print "Hex:\n@hex\n";
    

    This gives the following output:

    Length: 7
    utf8 flag: 
    Unicode:
    195 168 194 171 194 139 10
    Hex:
    c3a8c2abc28b0a
    ==============
    Length: 4
    utf8 flag: 1
    Unicode:
    232 171 139 10
    Hex:
    c3a8c2abc28b0a
    ==============
    Length: 2
    utf8 flag: 1
    Unicode:
    35531 10
    Hex:
    e8ab8b0a
    

    This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))

    This gives almost identical output, only the result of length differs:

    Length: 7
    utf8 flag: 
    Unicode:
    195 168 194 171 194 139 10
    Hex:
    c3a8c2abc28b0a
    ==============
    Length: 4
    utf8 flag: 1
    Unicode:
    232 171 139 10
    Hex:
    c3a8c2abc28b0a
    ==============
    Length: 4
    utf8 flag: 1
    Unicode:
    35531 10
    Hex:
    e8ab8b0a
    

    It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?

    Thanks,
    Matt

  • Matt
    Matt over 13 years
    Thanks for the response. The perl documentation does say it's okay to use the functions in utf8 module. The sentence after your quote is "The utility functions described below are directly usable without use utf8;", i.e. one should not "use" (perl keyword use) the utf8 pragma if one does not need to, but one can use (english use) its function. Also, I realize that "eaab860a" is the single-encoding. My file contains the octets "c3a8c2abc28b0a", which are the double encoding. It turns out that my confusion stems from a bug in the "length" function. See perlmonks.org/?node_id=874996
  • Admin
    Admin over 13 years
    It actually says "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without use utf8;.", which clearly does not mean "you are not supposed to use the functions from the utf8 pragma module". It means you don't need to use the pragma to import the functions.