Perl - read file with encoding method?

10,042

Solution 1

As noted in the comment on your question, I'm unsure what exactly you're asking.

So I'm assuming you're trying to convert Unicode characters into HTML entities. In which case, using one of the pre-made modules should be better. If that is not working due to encoding problems (which are quite tricky in Perl), then the answer to your question:

Is there not a encoding option like

open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8";

... will probably solve it, and it would probably make your own attempt work as well, but better to use a ready-made one ;-) (by the way, the way you wrote it there was as a "UTF-8" option to die which made it a little hard to understand what you were asking ;-)

Yes there is a UTF-8 option, assuming you have a recent perl (>= v5.8):

open(my $fh,'<:encoding(UTF-8)', $file) or die "Error opening $file: $!";

(example adapted from perluniintro)

You can also use binmode to change an already open filehandle (e.g. STDIN/OUT).

binmode(STDOUT, ":encoding(UTF-8)");

You can also set the default encoding with the open pragma.

But for this I suggest trying binmode or changing your open line to see if that solves it.

If you have a perl less than v5.8, things are trickier, but maybe resolvable if you tell us the version.

A couple of other things I noticed by the way:

  • Not essential, but it's considered better to use a lexically scoped filehandle (my $fh instead of FILE).
  • When you put a newline on the die string, it suppresses the line number information that is normally added to help you find the problem.
  • If you put the name of the file that couldn't be opened (or the SQL that failed, or whatever) in the die message it will be easier to debug.
  • Don't use sub prototypes in Perl (5) : (sub unicodeConvert($)). Don't put the $/@/% etc. in there. It doesn't just check things, it may change the meaning in confusing ways. It is only needed to create new "built-in style" operators.

Solution 2

I suspect, you a difference in charsets of your terminal (which can be UTF-8) and the source code of your perl script (which you maybe edit in some charset-aware editor in 8859-1). If you are sure, that your terminal and your source code are in the same charset, try to put use utf8; to your script header (see man perlunicode). If that does not help, try to print the data, which is stored to your database (increase debug logging for DBI) (maybe irrrelevant, as you don't store data as UTF8). Generally, try to provide:

  1. The codepage of your terminal (locale) if you execute your script for a terminal (or system locale, which is used by your server, if you launch it from e.g. apache)
  2. The charset of your source code.
  3. MySQL connection codepage (do you issue SET NAMES 'utf8'?)

Also for HTML encoding you may find easier to reuse HTML::Entities::decode() / HTML::Entities::encode() rather than implementing this on your own.

Share:
10,042
Phil Jackson
Author by

Phil Jackson

Updated on June 19, 2022

Comments

  • Phil Jackson
    Phil Jackson almost 2 years

    im not too good when it comes to encoding and I am wanting to figure out how to return data as the same encoding it started with...

    I have a file with some characters in such as '»' by the time I have edited and and inserted into database they have turned into Â&raquo.

    decode_entities() does nothing and encode_entities encodes the chars again. So i created my own sub to fix that but it apears that when getting the data from the file it isn't retrieving in the right format.

    my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php";
    {
        local( $/ ); # undefine the record seperator
        open FILE, "<", $file or die "Cannot open:$!\n";
        my $fileContents = unicodeConvert(<FILE>);
        ...
        .. 
    

    is there not a encoding option like;

    my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php";
    {
        local( $/ ); # undefine the record seperator
        open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8";
        my $fileContents = unicodeConvert(<FILE>);
        ...
        .. 
    

    and my sub is;

    sub unicodeConvert($) {
       my $str = shift;
        my %entityRef = ("&" => "&amp;", '¢' => "&cent;", '¤' => "&curren;", '¦' => "&brvbar;", '¨' => "&uml;", 'ª' => "&ordf;", '¬' => "&not;", '®' => "&reg;", '°' => "&deg;", '²' => "&sup2;", '´' => "&acute;", '¶' => "&para;", '¸' => "&cedil;", 'º' => "&ordm;", '¼' => "&frac14;", '¾' => "&frac34;", 'À' => "&Agrave;", 'Â' => "&Acirc;", 'Ä' => "&Auml;", 'Æ' => "&AElig;", 'È' => "&Egrave;", 'Ê' => "&Ecirc;", 'Ì' => "&Igrave;", 'Î' => "&Icirc;", 'Ð' => "&ETH;", 'Ò' => "&Ograve;", 'Ô' => "&Ocirc;", 'Ö' => "&Ouml;", 'Ø' => "&Oslash;", 'Ú' => "&Uacute;", 'Ü' => "&Uuml;", 'Þ' => "&THORN;", 'à' => "&agrave;", 'â' => "&acirc;", 'ä' => "&auml;", 'æ' => "&aelig;", 'è' => "&egrave;", 'ê' => "&ecirc;", 'ì' => "&igrave;", 'î' => "&icirc;", 'ð' => "&eth;", 'ò' => "&ograve;", 'ô' => "&ocirc;", 'ö' => "&ouml;", 'ø' => "&oslash;", 'ú' => "&uacute;", 'ü' => "&uuml;", 'þ' => "&thorn;", '¡' => "&iexcl;", '£' => "&pound;", '¥' => "&yen;", '§' => "&sect;", '©' => "&copy;", '«' => "&laquo;", '¯' => "&macr;", '±' => "&plusmn;", '³' => "&sup3;", 'µ' => "&micro;", '·' => "&middot;", '¹' => "&sup1;", '»' => "&raquo;", '½' => "&frac12;", '¿' => "&iquest;", 'Á' => "&Aacute;", 'Ã' => "&Atilde;", 'Å' => "&Aring;", 'Ç' => "&Ccedil;", 'É' => "&Eacute;", 'Ë' => "&Euml;", 'Í' => "&Iacute;", 'Ï' => "&Iuml;", 'Ñ' => "&Ntilde;", 'Ó' => "&Oacute;", 'Õ' => "&Otilde;", '×' => "&times;", 'Ù' => "&Ugrave;", 'Û' => "&Ucirc;", 'Ý' => "&Yacute;", 'ß' => "&szlig;", 'á' => "&aacute;", 'ã' => "&atilde;", 'å' => "&aring;", 'ç' => "&ccedil;", 'é' => "&eacute;", 'ë' => "&euml;", 'í' => "&iacute;", 'ï' => "&iuml;", 'ñ' => "&ntilde;", 'ó' => "&oacute;", 'õ' => "&otilde;", '÷' => "&divide;", 'ù' => "&ugrave;", 'û' => "&ucirc;", 'ý' => "&yacute;", 'ÿ' => "&yuml;");
        while( ( my $key, my $obj ) = each( %entityRef ) ) {
            if( $key ne '&' ) {
                    $str =~ s/$key/$obj/gis
            } else {
                    $str =~ s#&((?!(quot;)|(amp;)|(cent;)|(curren;)|(brvbar;)|(uml;)|(ordf;)|(not;)|(reg;)|(deg;)|(sup2;)|(acute;)|(para;)|(cedil;)|(ordm;)|(frac14;)|(frac34;)|(Agrave;)|(Acirc;)|(Auml;)|(AElig;)|(Egrave;)|(Ecirc;)|(Igrave;)|(Icirc;)|(ETH;)|(Ograve;)|(Ocirc;)|(Ouml;)|(Oslash;)|(Uacute;)|(Uuml;)|(THORN;)|(agrave;)|(acirc;)|(auml;)|(aelig;)|(egrave;)|(ecirc;)|(igrave;)|(icirc;)|(eth;)|(ograve;)|(ocirc;)|(ouml;)|(oslash;)|(uacute;)|(uuml;)|(thorn;)|(iexcl;)|(pound;)|(yen;)|(sect;)|(copy;)|(laquo;)|(macr;)|(plusmn;)|(sup3;)|(micro;)|(middot;)|(sup1;)|(raquo;)|(frac12;)|(iquest;)|(Aacute;)|(Atilde;)|(Aring;)|(Ccedil;)|(Eacute;)|(Euml;)|(Iacute;)|(Iuml;)|(Ntilde;)|(Oacute;)|(Otilde;)|(times;)|(Ugrave;)|(Ucirc;)|(Yacute;)|(szlig;)|(aacute;)|(atilde;)|(aring;)|(ccedil;)|(eacute;)|(euml;)|(iacute;)|(iuml;)|(ntilde;)|(oacute;)|(otilde;)|(divide;)|(ugrave;)|(ucirc;)|(yacute;)|(yuml;)|(nbsp;)))#$obj#gis;   
            }
        }
        return $str;
    }
    
  • FalseVinylShrub
    FalseVinylShrub about 14 years
    P.S. as noted by dma_k, if you continue with your own subroutine unicodeConvert, you'll need to either use utf8 to get the Unicode characters in the source code recognised or convert to numeric encodings e.g. `"\x{263a}".
  • brian d foy
    brian d foy about 14 years
    ':utf8' is a shortcut for ':encoding(UTF-8)'
  • FalseVinylShrub
    FalseVinylShrub about 14 years
    Thanks brian... I normally use ':utf8', but when checking the docs, it seemed there was a difference and ':encoding(UTF-8)' was safer for input files (found it: perluniintro, Unicode I/O, just below a big list of binmode examples)... can you clarify if this is true? Thanks.