How to extract under linux some capturing groups using command line in a php/preg fashion?
Solution 1
pcregrep -io1 'something="(\w+)"' myfile.txt
(-i
for case insensitive matching, -o1
to print the first capture group).
GNU grep
supports a -P
(if built with perl compatible regex support) and -o
. However its -o
is limited to printing the whole matched portions. You can however use perl look-around operators to work around that:
grep -iPo '(?<=something=")\w+(?=")' myfile.txt
(that is, a regexp that matches sequence of word component characters provided it follows something="
and is followed by "
).
Or with recent enough PCRE:
grep -iPo 'something="\K\w+(?=")' myfile.txt
(where \K
resets the start of the matched string).
But if you're going to use perl regexps, you might as well use perl
:
perl -C -lne 'print for /something="(\w+)"/ig' myfile.txt
With GNU or BSD sed
, to return only the right-most match per line:
sed -nE 's/.*something="(\w+)".*/\1/pi' myfile.txt
Portably (as extended regex support and case insensitive matching are non-standard extensions not supported by all sed
implementations):
sed -n 's/.*[sS][oO][mM][eE][tT][hH][iI][nN][gG]="\([[:alnum:]_]\{1,\}\)".*/\1/p' myfile.txt
That one assumes uppercase i
is I
. That means that in locales where uppercase i
is İ
for instance, the behaviour will be different from the previous solution.
A standard/portable solution that can find all the occurrences on a line:
awk '{while(match(tolower($0), /something="[[:alnum:]_]+"/)) {
print substr($0, RSTART+11, RLENGTH-12)
$0 = substr($0, RSTART+RLENGTH-1)}}' myfile.txt
That may not work correctly if the input contains text whose lower case version doesn't have the same length (in number of characters).
Gotchas:
- There will be some variations between all those solutions on what
\w
(and[[:alnum:]_]
) matches in locales other than the C/POSIX one. In any case it should at least include underscore, all the decimal arabic digits and the letters from the latin English alphabet (uppercase and lower case). If you want only those, fix the locale to C. - As already mentioned, case insensitive matching is very much locale-dependent. If you only care about
a-z
vsA-Z
English letters, you can fix the locaIle to C again. - The
.
regexp operator, with GNU implementations ofsed
at least will never match sequences of bytes that are not part of a valid character. In a UTF-8 locale, for instance, that means that it won't match characters from a single-byte charset with the 8th bit set. Or in other words, for thesed
solution to work properly, the character set used in the input file must be the same as the one in the user's locale. perl
,pcregrep
and GNU utilities will generally work with lines of any length, and containing any arbitrary byte value (but note the caveat above), and will consider the extra data after the last newline character as an extra line. Other implementations of those utilities may not.The patterns above are matched in-turn against each line in the input. That means that they can't match more than one line of input. Not a problem for a pattern like
something="\w+"
that can't span over more than one line, but in the general case, if you want your pattern to match text that may span several lines likesomething=".*?"
, then you'd need to either:- change the type of record you work on.
grep --null
,sed -z
(GNUsed
only),perl -0
,awk -v RS='\0'
(GNUawk
and recent versions ofmawk
only) can work on NUL-delimited records instead of lines (newline delimited records), GNUawk
can use any regexp as the record separator (with-v RS='regexp'),
perlany byte value (with
-0ooo`). pcregrep
has a-M
multiline mode for that.- use
perl
's slurp mode, where the whole input is the one record (with-0777
)
Then, for perl and pcre ones, beware that
.
will not match newline characters unless thes
flag is enabled, for instance withpcregrep -Mio1 '(?s)something="(.*?)"'
orperl -C -l -0777 -ne 'print for /something="(.*?)"/gis'
- change the type of record you work on.
- Beware that some versions of
grep
andpcregrep
have had bugs with-z
or-M
, and regexp engines in general can have some built-in limits on the amount of effort they may put into matching a regexp.
Solution 2
On linux you have multiple commands and each one has different features. - Your job is to find the right tool for the given job. ;)
You did not really specify a concrete problem, so I need to stay general.
Maybe the easiest way is to use perl directly:
cat file.txt | perl -wne '/([\w]+)/i and print $1'
Also read man grep
for some options of grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
You can use for example:
cat file.txt | grep -o '\w*'
But what is best really depends on your problem. If you like php, you can actually even use php from command line.
Solution 3
This is another answer based on perl
, this one uses perl -ne
which feeds/consumes all lines of the input into the perl program.
The perl
program has an if
statement containing your regex with the capture group and, when, we found a match, we print it.
When we print the capture group we add a newline. The newline is essential to ensure that multiple matches are separated by a newline, else, all your results will be mashed together on the same line and may produce an unexpected/undesireable result.
Should we get multiple lines matching the capture group, most of the time, we are only interested in the first matching line, hence, the head -1
usage.
The following bash
script illustrates how we may use this to process the input file and save the extracted result into the value
variable.
cat file.txt # something="nice"
value=$(cat file.txt | perl -ne 'if (/something="([\w]+)"/) { print $1 . "\n" }' | head -1)
echo $value # nice
Related videos on Youtube
user3450548
Updated on September 18, 2022Comments
-
user3450548 over 1 year
Given in Linux environment exists lot of packages for manipulating strings (grep, awk, sed, ...), I would like a software to extract a capturing group in a php/preg like syntax.
Maybe the most close one is
grep -P
but I don't get how it works.Stuff like
cat file.txt | grep -P '/something="([\w]+)"/i'
seems not to give me only the content inside the capturing group.Could someone provide me some working examples? Many please, with some variants and limits explained!
EDIT: I saw somewhere used
sed
for doing this purpose but I'm still a bit confused about it's syntax.-
michas about 8 yearsWhat exactly is the problem you are trying to solve?
-
-
user3450548 about 8 yearsVery nice solution the perl usage directly, i like it, both answers are very nice!