Removing newline character from a string in Perl

67,908

Solution 1

The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.

The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.

 use v5.10;                     # minimal Perl version for \R support
 use utf8;                      # source is in UTF-8
 use warnings qw(FATAL utf8);   # encoding errors raise exceptions
 use open qw(:utf8 :std);       # default open mode, `backticks`, and std{in,out,err} are in UTF-8

 while (<>) {
     s/\R\z//;
     ...
 }

Solution 2

You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".

This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:

my $var = "foo bar\r\n";
chomp $var;
print "$var\n";  # Remove and put back newline

But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:

print "$var: WRONG\n";

It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:

foo bar\r           # \r resets position
 : WRONG\n          # Second line prints and overwrites

This is more obvious when the first line is longer than the second. For example, try the following:

perl -we 'print "foo bar\rbaz\n"'

And you will get the output:

baz bar

The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:

$line =~ s/[\r\n]+$//;

Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.

You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.

Why use $line and $line2 when they have the same value? Just use $line.

And seriously, what is up with this:

if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {

That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?

First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:

if ($line =~ /^(.+)(\.docx)\s*$/) {
    my $pre = $1;
    my $ext = $2;

Solution 3

You can wipe the linebreaks with something like this:

$line =~ s/[\n\r]//g;

When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.

I also wouldn't do this type of thing:

print $line2." WRONG FORMAT!\n";

You can do

print "$line2 WRONG FORMAT!\n";

... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.

Solution 4

You can do something like:

=~ tr/\n//

But really chomp should work:

while (<filehandle>){
   chomp;
   ...
}

Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.

Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.

Solution 5

$variable = join('',split(/\n/,$variable))
Share:
67,908
erogol
Author by

erogol

Mozilla TTS - https://github.com/mozilla/TTS

Updated on July 09, 2022

Comments

  • erogol
    erogol almost 2 years

    I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.

    I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?

    use strict;
    use warnings;
    use v5.12;
    use utf8;
    use encoding "utf-8";
    
    open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
    
    my @strings;
    my @fileNames;
    my @erroredFileNames;
    
    my $delimiter;
    my $extensions;
    my $id;
    my $surname;
    my $name;
    
    while (<MYINPUTFILE>)
    {
        my ($line) = $_;
        my ($line2) = $_;
        if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
            #chop($line2);
            $line2 =~ s/^\n+//;
            print $line2 . " WRONG FORMAT!\n";
        }
        else {
            #print "INSERTED:".$13."\n";
            my($id) = $13;
            my($name) = $2;
            print $name . "\t" . $id . "\n";
            unshift(@fileNames, $line2);
            unshift(@strings, $line2 =~ /[^\W_]+/g);
        }
    }
    close(MYINPUTFILE);
    
    • tchrist
      tchrist about 12 years
      @TLP Please don’t pretend that Perl character classes have ASCII definitions, because that’s quite wrong in Perl. You have to use the definitions from UTS#18 Annex C.
    • tchrist
      tchrist about 12 years
      @TLP Yes, of course it isn’t. \w is equal to [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctu‌​ation}]. This is well-known. It covers 102,724 code points as of Unicode v6.0, which is four orders of magnitude more of them than the scant 63 that you mention.
  • tchrist
    tchrist about 12 years
    @stackoverflow Provided you do $data =~ s/\R//g that could work; notice I removed the \z boundary. Not sure why you want all the newlines gone.
  • Ωmega
    Ωmega about 12 years
    How about $/=undef; $data=<MYINPUTFILE>; data=~s/\R//g; ..?
  • tchrist
    tchrist about 12 years
    @stackoverflow Sure, that’s fine.