How can I convert an input file to UTF-8 encoding in Perl?

10,082

Solution 1

I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.

When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.


old answer

The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.

You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.

If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.

Solution 2

The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.

But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)

To do that, all you need is to apply an additional layer to your open:

open my $foo, "<:encoding(gb2312):bytes", ...;

Note that the output of the following will be the same:

perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'

but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.

Share:
10,082
Mike
Author by

Mike

I'm learning Perl as my first programming language. I'm enjoying it and I'm grateful to the SO members.

Updated on June 04, 2022

Comments

  • Mike
    Mike almost 2 years

    I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:

    # outfile.txt is in GB-2312 encode    
    open my $filter,"<",'c:/outfile.txt'; 
    
    while(<$filter>){
    #convert each line of outfile.txt to UTF-8 encoding   
        $_ = Encode::decode("gb2312", $_); 
    ...}
    

    But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like

    #outfile.txt is in GB-2312 encode
    open my $filter,"<:utf8",'c:/outfile.txt'; 
    

    (Perl says something like "utf8 "\xD4" does not map to Unicode" )

    and

    open my $filter,"<",'c:/outfile.txt'; 
    $filter = Encode::decode("gb2312", $filter); 
    

    (Perl says "readline() on unopened filehandle!)

    They don't work. But is there some way to directly convert the input file to UTF-8 encode?

    Update:

    Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:

    open my $filter,'<:encoding(gb2312)','c:/outfile.txt'; 
    open my $filter_new, '+>:utf8', 'c:/outfile_new.txt'; 
    print $filter_new $_ while <$filter>; 
    while (<$filter_new>){
    ...
    } 
    

    But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.

  • Mike
    Mike over 14 years
    @brian, thanks for the guidance. I thought there should be some simple way to directly convert the input file to UTF-8 encode while opening it. But now it looks like things are not that simple. Im thinking I can open the input file first and then encode the content to UTF-8 and then output to another file in UTF-8 encode and then open that another file. The code looks like: open my $filter,'<:encoding(gb2312)','c:/outfile.txt'; open my $filter_new, '+>:utf8', 'c:/f2.txt'; print $filter_new $_ while <$filter>; while (<$filter_new>){...} But this is too much work. while(<$fh_out>){
  • brian d foy
    brian d foy over 14 years
    Your idea of too much work is skewed. Try doing it by hand and then come back and tell us how easy Perl makes it for you. Kids today don't know how good they have it. :)
  • ysth
    ysth over 14 years
    Mike's instincts are correct; you can stack layers to directly do the conversion he wants :)
  • brian d foy
    brian d foy over 14 years
    You can't stack layers, really. You still have to read it, and you still have to write it, if you want to file to end up in a different encoding.
  • ysth
    ysth over 14 years
    I'm pretty sure (it's a little clearer in the original part of the question, I think) that all he wants is to convert the data from the file, not the file itself. But yes, to do the latter, just reading isn't sufficient
  • Mike
    Mike over 14 years
    @ysth, I guess I must have phrased my question wrong. Actually what I wanted was to convert the input file to UTF-8 and then do a readline operation. I already knew how to convert the data of the input file while doing a readline operation using the while loop. But thanks.
  • Mike
    Mike over 14 years
    @brian, well, yes, one way of looking at my question is: "are there some better ways to read a file in a non-UTF-8 encoding and then play with the data as UTF-8?" By "better ways", I mean not the line-by-line conversion method which I already learnt.