Perl one liner to extract a multi-line pattern

perl bash sed awk perl-module

12,149

Solution 1

The regex does not match even the single line. What do you think the double parentheses do?

You probably wanted

m/^\s*(\w+)\s+(\w+?)\s*\([\w0-9,*\s]+\)\s{/gm

Update: The specification has changed. The regex has (almost) not, but you have to change the code slightly:

perl -0777 -nle 'print "$1\n" while m/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/gm'

Another update:

Explanation:

The switches are described in perlrun: zero, n, l, e

The regex can be auto-explained by YAPE::Regex::Explain

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/)->explain'
The regular expression:

(?-imsx:^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \w+?                     word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the least amount
                             possible))
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \(                       '('
----------------------------------------------------------------------
    [\w0-9,*\s]+             any character of: word characters (a-z,
                             A-Z, 0-9, _), '0' to '9', ',', '*',
                             whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \)                       ')'
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    {                        '{'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

The /gm switches are explained in perlre

Solution 2

Use the Flip-Flop Operator for a One-Liner

Perl makes this really easy with the flip-flop operator, which will allow you to print out all the lines between two regular expressions. For example:

$ perl -ne 'print if /^abcd25/ ... /\bhj \) {/' /tmp/foo
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {

However, a simple one-liner like this won't differentiate between matches where you want to reject specific matches between the delimiting patterns. That calls for a more complex approach.

More Complicated Comparisons Benefit from Conditional Branching

One-liners aren't always the best choice, and regular expressions can get out of hand quickly if they become too complex. In such situations, you're better off writing an actual program that can use conditional branching rather than trying to use an over-clever regular expression match.

One way to do this is to build up your match with a simple pattern, and then reject any match that doesn't match some other simple pattern. For example:

#!/usr/bin/perl -nw

# Use flip-flop operator to select matches.
if (/^abcd25/ ... /\bhj \) {/) {
    push @string, $_
};

# Reject multi-line patterns that don't include a particular expression
# between flip-flop delimiters. For example, "( fg" will match, while
# "^fg" won't.
if (/\bhj \) {/) {
    $string = join("", @string);
    undef @string;
    push(@matches, $string) if $string =~ /\( fg/;
};

END {print @matches}

When run against the OP's updated corpus, this correctly yields:

abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
abcd25 ef_gh ( fg*_h hj_b* hj ) {

12,149

Author by

Gil

Chui qqn de bien dans la tempête.

Updated on July 18, 2022

Comments

Gil almost 2 years
I have a pattern in a file as follows which can/cannot span over multiple lines :
```
 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {
```
What I have tried :

perl -nle 'print while m/^\s*(\w+)\s+(\w+?)\s*(([\w-0-9,* \s]))\s{/gm'

I dont know what the flags mean here but all I did was write a regex for the pattern and insert it in the pattern space .This matches well if the the pattern is in a single line as :
```
abcd25 ef_gh ( fg*_h hj_b* hj ) {
```
But fails exclusively in the multiline case !

I started with perl yesterday but the syntax is way too confusing . So , as suggested by one of our fellow SO mate ,I wrote a regex and inserted it in the code provided by him .

I hope a perl monk can help me in this case . Alternative solutions are welcome .

Input file :
```
 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {

 abcd25
 ef_gh
 fg*_h
 hj_b*
 hj ) {

 jhijdsiokdù ()lmolmlxjk;
 abcd25 ef_gh ( fg*_h hj_b* hj ) {
```
Expected output :
```
 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {
 abcd25 ef_gh ( fg*_h hj_b* hj ) {
```
The input file can have multiple patterns which coincides with the start and end pattern of the required pattern. Thanks in advance for the replies.
Gil over 11 years

I am not sure what double parentheses does :( I wrote the regex via a simulator ;)
Gil over 11 years

Now the single line match is ok but still stuck at multiline !
Gil over 11 years

Yes, But this will interfere with other patterns in the file .
Todd A. Jacobs over 11 years

@Geekasaur Sorry, but this exactly matches your corpus and your expected output, as currently defined in your question. Please update your question if you have other and/or additional requirements.
pavel over 11 years

@Geekasaur: the above pattern also works with multi line input!
Gil over 11 years

gnome : Sorry for not being specific . I will update the question to transmit a better idea .
Todd A. Jacobs over 11 years

@Geekasaur If you change start-of-line to start-of-word, how does perl -ne 'print if /^abcd25/ ... /\bhj \) {/' /tmp/foo not do what you want?
Gil over 11 years

Yes It does extract the pattern but throws in unwanted matches too ! May be if you add a brief description to your code ,I can tweak my regex a bit like ,should the end pattern of the match be at the beginning of the line .
Gil over 11 years

@pavel Thanks ,Indeed it does :) Can you add a brief description to the flags used and what perl does in this scenario ?