perl: Uncaught exception: malformed UTF-8 character in JSON string

json perl unicode utf-8

18,590

Solution 1

I expand on my answer in Know the difference between character strings and UTF-8 strings.

From reading the JSON docs, I think those functions don't want a character string, but that's what you're trying to give it. Instead, they want a "UTF-8 binary string". That seems odd to me, but I'm guessing that it's mostly to take input directly from an HTTP message instead of something that you type directly in your program. This works because I make a byte string that's the UTF-8 encoded version of your string:

use v5.14;

use utf8;                                                 
use warnings;                                             
use feature     qw< unicode_strings >;

use Data::Dumper;
use Devel::Peek;
use JSON;

my $filename = 'hei.txt';
my $char_string = qq( { "my_test" : "hei på deg" } );
open my $fh, '>:encoding(UTF-8)', $filename;
print $fh $char_string;
close $fh;


{
say '=' x 70;
my $byte_string = qq( { "my_test" : "hei p\303\245 deg" } );
print "Byte string peek:------\n"; Dump( $byte_string );
decode( $byte_string );
}


{
say '=' x 70;
my $raw_string = do { 
    open my $fh, '<:raw', $filename;
    local $/; <$fh>;
    };
print "raw string peek:------\n"; Dump( $raw_string );

decode( $raw_string );
}

{
say '=' x 70;
my $char_string = do { 
    open my $fh, '<:encoding(UTF-8)', $filename;
    local $/; <$fh>;
    };
print "char string peek:------\n"; Dump( $char_string );

decode( $char_string );
}

sub decode {
    my $string = shift;

    my $hash_ref2 = eval { decode_json( $string ) };
    say "Error in sub form: $@" if $@;
    print Dumper( $hash_ref2 );

    my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
    say "Error in method form: $@" if $@;
    print Dumper( $hash_ref1 );
    }

The output shows that the character string doesn't work, but the byte string versions do:

======================================================================
Byte string peek:------
SV = PV(0x100801190) at 0x10089d690
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100209890 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
  CUR = 31
  LEN = 32
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
======================================================================
raw string peek:------
SV = PV(0x100839240) at 0x10089d780
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100212260 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
  CUR = 31
  LEN = 32
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
======================================================================
char string peek:------
SV = PV(0x10088f3b0) at 0x10089d840
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002017b0 " { \"my_test\" : \"hei p\303\245 deg\" } "\0 [UTF8 " { "my_test" : "hei p\x{e5} deg" } "]
  CUR = 31
  LEN = 32
Error in sub form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 51.

$VAR1 = undef;
Error in method form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 55.

$VAR1 = undef;

So, if you take your character string, which you typed directly into your program, and convert it to a UTF-8 encoded byte string, it works:

use v5.14;

use utf8;                                                 
use warnings;                                             
use feature     qw< unicode_strings >;

use Data::Dumper;
use Encode qw(encode_utf8);
use JSON;

my $char_string = qq( { "my_test" : "hei på deg" } );

my $string = encode_utf8( $char_string );

decode( $string );

sub decode {
    my $string = shift;

    my $hash_ref2 = eval { decode_json( $string ) };
    say "Error in sub form: $@" if $@;
    print Dumper( $hash_ref2 );

    my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
    say "Error in method form: $@" if $@;
    print Dumper( $hash_ref1 );
    }

I think JSON should be smart enough to deal with this so you don't have to think at this level, but that's the way it is (so far).

Solution 2

The docs say

$perl_hash_or_arrayref  = decode_json $utf8_encoded_json_text;

yet you do everything in your power to decode the input before passing it to decode_json.

use strict;
use warnings;
use utf8;

use Data::Dumper qw( Dumper );
use Encode       qw( encode );
use JSON         qw( );

for my $json_text (
   qq{{ "my_test" : "hei på deg" }\n},
   qq{{ "water" : "水" }\n},
) {
   my $json_utf8 = encode('UTF-8', $json_text);  # Counteract "use utf8;"
   my $data = JSON->new->utf8->decode($json_utf8);

   local $Data::Dumper::Useqq  = 1;
   local $Data::Dumper::Terse  = 1;
   local $Data::Dumper::Indent = 0;
   print(Dumper($data), "\n");
}

Output:

{"my_test" => "hei p\x{e5} deg"}
{"water" => "\x{6c34}"}

PS — It would be easier to help you if you didn't have two pages of code to demonstrate a simple problem.

Solution 3

If there is a malformed UTF-8 character in your data, you can remove it in the following way (imagine that data are contained in data.txt):

iconv -f utf-8 -t utf-8 -c < data.txt > clean-data.txt

The -c option of iconv will silently remove all malformed characters.

18,590

Author by

hlovdal

Linux user since 1994. Main programming language: C. #SOreadytohelp (http://stackoverflow.com/10m)

Updated on June 19, 2022

Comments

hlovdal almost 2 years

Related to this question and this answer (to another question) I am still unable to process UTF-8 with JSON.

I have tried to make sure all the required voodoo is invoked based on recommendations from the very best experts, and as far as I can see the string is as valid, marked and labelled as UTF-8 as possible. But still perl dies with either

Uncaught exception: malformed UTF-8 character in JSON string

Uncaught exception: Wide character in subroutine entry

What am I doing wrong here?

(hlovdal) localhost:/work/2011/perl_unicode>cat json_malformed_utf8.pl 
#!/usr/bin/perl -w -CSAD

### BEGIN ###
# Apparently the very best perl unicode boiler template code that exist,
# https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
# Slightly modified.

use v5.12; # minimal for unicode string feature
#use v5.14; # optimal for unicode string feature

use utf8;                                                 # Declare that this source unit is encoded as UTF‑8. Although
                                                          # once upon a time this pragma did other things, it now serves
                                                          # this one singular purpose alone and no other.
use strict;
use autodie;

use warnings;                                             # Enable warnings, since the previous declaration only enables
use warnings    qw< FATAL  utf8     >;                    # strictures and features, not warnings. I also suggest
                                                          # promoting Unicode warnings into exceptions, so use both
                                                          # these lines, not just one of them. 

use open        qw( :encoding(UTF-8) :std );              # Declare that anything that opens a filehandles within this
                                                          # lexical scope but not elsewhere is to assume that that
                                                          # stream is encoded in UTF‑8 unless you tell it otherwise.
                                                          # That way you do not affect other module’s or other program’s code.

use charnames   qw< :full >;                              # Enable named characters via \N{CHARNAME}.
use feature     qw< unicode_strings >;

use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

END { close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) { 
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$| = 1;

binmode(DATA, ":encoding(UTF-8)");                        # If you have a DATA handle, you must explicitly set its encoding.

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stackdumped exceptions
#   *unless* we're in an try block, in which 
#   case just generate a clucking stackdump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" } 
    else     { confess "Deadly warning: @_"  }
};

### END ###


use JSON;
use Encode;

use Getopt::Long;
use Encode;

my $use_nfd = 0;
my $use_water = 0;
GetOptions("nfd" => \$use_nfd, "water" => \$use_water );

print "JSON->backend->is_pp = ", JSON->backend->is_pp, ", JSON->backend->is_xs = ", JSON->backend->is_xs, "\n";

sub check {
        my $text = shift;
        return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}

my $json_text = "{ \"my_test\" : \"hei på deg\" }\n";
if ($use_water) {
        $json_text = "{ \"water\" : \"水\" }\n";
}
if ($use_nfd) {
        $json_text = NFD($json_text);
}

print check($json_text), "\$json_text = $json_text";

# test from perluniintro(1)
if (eval { decode_utf8($json_text, Encode::FB_CROAK); 1 }) {
        print "string is valid utf8\n";
} else {
        print "string is not valid utf8\n";
}

my $hash_ref1 = JSON->new->utf8->decode($json_text);
my $hash_ref2 = decode_json( $json_text );

__END__

Running this gives

(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei på deg" }
string is valid utf8
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl | ./uniquote 
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei p\N{U+E5} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei pa\N{U+30A} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "水" }
string is valid utf8
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water --nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>rpm -q perl perl-JSON perl-JSON-XS
perl-5.12.4-159.fc15.x86_64
perl-JSON-2.51-1.fc15.noarch
perl-JSON-XS-2.30-2.fc15.x86_64
(hlovdal) localhost:/work/2011/perl_unicode>

uniquote is from http://training.perl.com/scripts/uniquote

Update:

Thanks to brian for highlighting the solution. Updating the source to use json_text for all normal strings and json_bytes for what is going to be passed to JSON like the following now works like expected:

my $json_bytes = encode('UTF-8', $json_text);
my $hash_ref1 = JSON->new->utf8->decode($json_bytes);

I must say that I think the documentation for the JSON module is extremely unclear and partially misleading.

The phrase "text" (at least to me) implies a string of characters. So when reading $perl_scalar = decode_json $json_text I have an expectation of json_text being a UTF-8 encoded string of characters. Thoroughly re-reading the documentation, knowing what to look for, I now see it says: "decode_json ... expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text", however that still is not clear in my opinion.

From my background using a language having some additional non-ASCII characters, I remember back in the days where you had to guess the code page being used, email used to just cripple text by stripping of the 8th bit, etc. And "binary" in the context of strings meant a string containing characters outside the 7-bit ASCII domain. But what is "binary" really? Isn't all strings binary at the core level?

The documentation also says "simple and fast interfaces (expect/generate UTF-8)" and "correct unicode handling", first point under "Features", both without mentioning anywhere near that it does not want a string but instead a byte sequence. I will request the author to at least make this clearer.

brian d foy over 12 years

Tom's Unicode utilities are also available as Unicode::Tussle.

ikegami over 12 years

Re "That seems odd to me", it doesn't seem odd to me at all. You don't decode XML before passing it to an XML parser. You don't decode a Perl program before passing it to Perl. If those parsers required decoded text, the file would have to be parsed twice: once to determine the encoding and once to do the actual parsing. Quite the opposite, it seems very weird to me to decode something before passing it to a method called decode.
ikegami over 12 years

Re "I think JSON should be smart enough to deal with this", it's impossible. It has no way to know whether something has already been decoded or not.
brian d foy over 12 years

It seems odd to me because I didn't expect it. I'd like things to just work without this level of thought. At OSCON, my main gripe was that we just have a scalar and we can't tell if it's a binary string or a character string. As for being smart enough, maybe it is impossible, but I'd still like that feature. I wonder what is happening though. If Perl has it marked as a UTF-8 string, what is JSON really using? I'll have to investigate more.
brian d foy over 12 years

I don't think it's odd to decode something before passing it to something called decode(). One deals with the character encoding, and one deals with the data structure encoding. Those are different beasts.
brian d foy over 12 years

I also recently read that some browsers indeed parse twice. They go along until they hit a META header that declares an encoding, then starts over assuming that encoding.
ikegami over 12 years

HTML5 details a look ahead approach. Not much choice in the matter because the Content-Type can appear after it's needed. It's still done as part of the same parsing process, though.
SineSwiper over 9 years

RE: "You don't decode XML before passing it to an XML parser." Depends on the source. If you acquire it from a database, DBD::mysql is going to auto-decode it with the mysql_enable_utf8 switch on. If you acquire it from HTTP input, you -should- be decoding that sucker ASAP. Decode early, encode late.