How do I read UTF-8 with diamond operator (<>)?

15,794

Solution 1

Try to use the pragma open instead:

use strict;
use warnings;
use open qw(:std :utf8);

while(<>){
    my @chars = split //, $_;
    print "$_" foreach(@chars);
}

You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.

Solution 2

Your script works if you do this:

#!/usr/bin/perl -w

binmode STDOUT, ':utf8';

while(<>){
    binmode ARGV, ':utf8';

    my @chars = split //, $_;
    print "$_\n" foreach(@chars);
}

The magic filehandle that <> reads from is called *ARGV, and it is opened when you call readline.

But really, I am a fan of explicitly using Encode::decode and Encode::encode when appropriate.

Solution 3

You can switch on UTF8 by default with the -C flag:

perl -CSD -ne 'print join("\n",split //);' utf8.txt

The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.

This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.

Solution 4

If you put a call to binmode inside of the while loop, then it will switch the handle to utf8 mode AFTER the first line is read in. That is probably not what you want to do.

Something like the following might work better:

#!/usr/bin/env perl -w
binmode STDOUT, ':utf8';
eof() ? exit : binmode ARGV, ':utf8';
while( <> ) {
    my @chars = split //, $_;
    print "$_\n" foreach(@chars);
} continue {
    binmode ARGV, ':utf8' if eof && !eof();
}

The call to eof() with parens is magical, as it checks for end of file on the pseudo-filehandle used by <>. It will, if necessary, open the next handle that needs to be read, which typically has the effect of making *ARGV valid, but without reading anything out of it. This allows us to binmode the first file that's read from, before anything is read from it.

Later, eof (without parens) is used; this checks the last handle that was read from for end of file. It will be true after we process the last line of each file from the commandline (or when stdin reaches it's end).

Obviously, if we've just processed the last line of one file, calling eof() (with parens) opens the next file (if there is one), makes *ARGV valid (if it can), and tests for end of file on that next file. If that next file is present, and isn't at end of file, then we can safely use binmode on ARGV.

Share:
15,794
Frank
Author by

Frank

Updated on June 02, 2022

Comments

  • Frank
    Frank almost 2 years

    I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.

    So my script should be callable in these two ways, as usual, giving the same output:

    ./script.pl utf8.txt
    cat utf8.txt | ./script.pl
    

    But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:

    #!/usr/bin/perl -w
    
    binmode STDIN, ':utf8';
    binmode STDOUT, ':utf8';
    
    while(<>){
        my @chars = split //, $_;
        print "$_\n" foreach(@chars);
    }
    

    How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.

    EDIT:

    I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:

    a
    \xCA\xA7
    b
    

    But the other method gives me this:

    a
    \xC3\x8A
    \xC2\xA7
    b
    
  • Shruti Singh
    Shruti Singh about 15 years
    There is issue with -C switch since perl 5.10 fi.muni.cz/~kas/blog/index.cgi/computers/…
  • Shruti Singh
    Shruti Singh about 15 years
    Off topic: Using '#!/usr/bin/perl' is not recommended shebang line, see perlrun for details. If you don't wont perlrun approach use #!/usr/bin/env perl which is more portable than #!/usr/bin/perl
  • Bruno De Fraine
    Bruno De Fraine about 15 years
    Thanks, I made it clear you should only use this when you invoke perl directly.
  • brian d foy
    brian d foy about 15 years
    Do you have to have the binmode in the while because ARGV is reset for multiple files?
  • mavit
    mavit over 11 years
    I looked at this and thought, "That won't work! You're setting binmode after the first line has already been read from <>". However, I tried it, and it does work. Highly magical.
  • Keith Thompson
    Keith Thompson about 5 years
    @Hynek-Pichi-Vychodil: Greetings from ten years in the future! There are advantages and disadvantages to the #!/usr/bin/env trick. These days you can usually assume that perl is installed in /usr/bin. See my answer to this question on Unix & Linux for details.
  • Metamorphic
    Metamorphic over 3 years
    @Hynek-Pichi-Vychodil: I tried putting -CS on the "#!" line (Perl version 5.32) and it seems to work again.
  • Bruno De Fraine
    Bruno De Fraine over 3 years
    @Metamorphic You are right, I tried on my system (Perl version 5.28) and it works too.