How to extract under linux some capturing groups using command line in a php/preg fashion?

linux debian grep regular-expression php

6,845

Solution 1

pcregrep -io1 'something="(\w+)"' myfile.txt

(-i for case insensitive matching, -o1 to print the first capture group).

GNU grep supports a -P (if built with perl compatible regex support) and -o. However its -o is limited to printing the whole matched portions. You can however use perl look-around operators to work around that:

grep -iPo '(?<=something=")\w+(?=")' myfile.txt

(that is, a regexp that matches sequence of word component characters provided it follows something=" and is followed by ").

Or with recent enough PCRE:

grep -iPo 'something="\K\w+(?=")' myfile.txt

(where \K resets the start of the matched string).

But if you're going to use perl regexps, you might as well use perl:

perl -C -lne 'print for /something="(\w+)"/ig' myfile.txt

With GNU or BSD sed, to return only the right-most match per line:

sed -nE 's/.*something="(\w+)".*/\1/pi' myfile.txt

Portably (as extended regex support and case insensitive matching are non-standard extensions not supported by all sed implementations):

sed -n 's/.*[sS][oO][mM][eE][tT][hH][iI][nN][gG]="\([[:alnum:]_]\{1,\}\)".*/\1/p' myfile.txt

That one assumes uppercase i is I. That means that in locales where uppercase i is İ for instance, the behaviour will be different from the previous solution.

A standard/portable solution that can find all the occurrences on a line:

awk '{while(match(tolower($0), /something="[[:alnum:]_]+"/)) {
    print substr($0, RSTART+11, RLENGTH-12)
    $0 = substr($0, RSTART+RLENGTH-1)}}' myfile.txt

That may not work correctly if the input contains text whose lower case version doesn't have the same length (in number of characters).

Gotchas:

There will be some variations between all those solutions on what \w (and [[:alnum:]_]) matches in locales other than the C/POSIX one. In any case it should at least include underscore, all the decimal arabic digits and the letters from the latin English alphabet (uppercase and lower case). If you want only those, fix the locale to C.
As already mentioned, case insensitive matching is very much locale-dependent. If you only care about a-z vs A-Z English letters, you can fix the locaIle to C again.
The . regexp operator, with GNU implementations of sed at least will never match sequences of bytes that are not part of a valid character. In a UTF-8 locale, for instance, that means that it won't match characters from a single-byte charset with the 8th bit set. Or in other words, for the sed solution to work properly, the character set used in the input file must be the same as the one in the user's locale.
perl, pcregrep and GNU utilities will generally work with lines of any length, and containing any arbitrary byte value (but note the caveat above), and will consider the extra data after the last newline character as an extra line. Other implementations of those utilities may not.
The patterns above are matched in-turn against each line in the input. That means that they can't match more than one line of input. Not a problem for a pattern like something="\w+" that can't span over more than one line, but in the general case, if you want your pattern to match text that may span several lines like something=".*?", then you'd need to either:
- change the type of record you work on. grep --null, sed -z (GNU sed only), perl -0, awk -v RS='\0' (GNU awk and recent versions of mawk only) can work on NUL-delimited records instead of lines (newline delimited records), GNU awk can use any regexp as the record separator (with -v RS='regexp'),perlany byte value (with-0ooo`).
- pcregrep has a -M multiline mode for that.
- use perl's slurp mode, where the whole input is the one record (with -0777)
Then, for perl and pcre ones, beware that . will not match newline characters unless the s flag is enabled, for instance with pcregrep -Mio1 '(?s)something="(.*?)"' or perl -C -l -0777 -ne 'print for /something="(.*?)"/gis'
Beware that some versions of grep and pcregrep have had bugs with -z or -M, and regexp engines in general can have some built-in limits on the amount of effort they may put into matching a regexp.

Solution 2

On linux you have multiple commands and each one has different features. - Your job is to find the right tool for the given job. ;)

You did not really specify a concrete problem, so I need to stay general.

Maybe the easiest way is to use perl directly:

cat file.txt | perl -wne '/([\w]+)/i and print $1'

Also read man grep for some options of grep:

   -o, --only-matching
          Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

You can use for example:

cat file.txt | grep -o '\w*'

But what is best really depends on your problem. If you like php, you can actually even use php from command line.

Solution 3

This is another answer based on perl, this one uses perl -ne which feeds/consumes all lines of the input into the perl program.

The perl program has an if statement containing your regex with the capture group and, when, we found a match, we print it.

When we print the capture group we add a newline. The newline is essential to ensure that multiple matches are separated by a newline, else, all your results will be mashed together on the same line and may produce an unexpected/undesireable result.

Should we get multiple lines matching the capture group, most of the time, we are only interested in the first matching line, hence, the head -1 usage.

The following bash script illustrates how we may use this to process the input file and save the extracted result into the value variable.

cat file.txt # something="nice"
value=$(cat file.txt | perl -ne 'if (/something="([\w]+)"/) { print $1 . "\n" }' | head -1)
echo $value # nice

6,845

user3450548

Updated on September 18, 2022

Comments

user3450548 over 1 year

Given in Linux environment exists lot of packages for manipulating strings (grep, awk, sed, ...), I would like a software to extract a capturing group in a php/preg like syntax.

Maybe the most close one is grep -P but I don't get how it works.

Stuff like cat file.txt | grep -P '/something="([\w]+)"/i' seems not to give me only the content inside the capturing group.

Could someone provide me some working examples? Many please, with some variants and limits explained!

EDIT: I saw somewhere used sed for doing this purpose but I'm still a bit confused about it's syntax.
- michas about 8 years
  
  What exactly is the problem you are trying to solve?
user3450548 about 8 years

Very nice solution the perl usage directly, i like it, both answers are very nice!