How to find all patterns between two characters?

command-line text-processing regex

21,848

Solution 1

First of all, your grep -Po '"\K[^"]*' file idea fails because grep sees both "One" and ". the second is here" as being inside quotes. Personally, I'd probably just do

$ grep -oP '"[^"]+"' file | tr -d '"'
One
Two 
 Three 
Four

But that is two commands. To do it with a single command, you could use one of:

Perl
```
$ perl -lne '@F=/"\s*([^"]+)\s*"/g; print for @F' file 
One
Two 
Three 
Four
```
Here, the @F array holds all matches of the regex (a quote, followed by as many non-" as possible until the next "). The print for @F just means "print each element of @F.

Perl

$ perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){print $F[$i]}' file 
One
Two 
 Three 
Four

To remove leading/trailing spaces from each match, use this:

perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){$F[$i]=~s/^\s*|\s$//; print $F[$i]}' file

Here, Perl is behaving like awk. The -a switch causes it to automatically split input lines into fields on the character given by -F. Since I have given it ", the fields are:

$ perl -F'"' -lne 'for($i=0;$i<=$#F;$i++){print "Field $i: $F[$i]"}' file 
Field 0: first matched is 
Field 1: One
Field 2: . the second is here
Field 3: Two 
Field 0: and here are in second line
Field 1:  Three 
Field 2: 
Field 3: Four
Field 4: .

Because we are looking for text between two consecutive field separators, we know we want every second field. So, for($i=1;$i<=$#F;$i+=2){print $F[$i]} will print the ones we care about.

The same idea but in awk:

$ awk -F'"' '{for(i=2;i<=NF;i+=2){print $(i)}}' file 
One
Two 
 Three 
Four

Solution 2

The key is to consume the quotes in your expression. Hard to do that with a single grep command. Here's a perl one-liner:

perl -0777 -nE 'say for /"(.*?)"/sg' file

That slurps the whole input and prints out the captured part of the match. It will work even if there's a newline inside the quotes, although it then becomes difficult to separate elements with and without newlines. To help with that, use a different character as the output record separator, the null character for instance

perl -0777 -lne 'print for /"(.*?)"/sg} BEGIN {$\="\0"' <<DATA | od -c
blah "first" blah "second
quote with newline" blah "third"
DATA

0000000   f   i   r   s   t  \0   s   e   c   o   n   d  \n   q   u   o
0000020   t   e       w   i   t   h       n   e   w   l   i   n   e  \0
0000040   t   h   i   r   d  \0
0000046

Solution 3

This could be possible with the below grep one liner and i assumed that you have balanced quotation marks.

grep -oP '"\s*\K[^"]+?(?=\s*"(?:[^"]*"[^"]*")*[^"]*$)' file

Example:

$ cat file
first matched is "One". the second is here"Two "
and here are in second line" Three ""Four".
$ grep -oP '"\s*\K[^"]+?(?=\s*"(?:[^"]*"[^"]*")*[^"]*$)' file
One
Two
Three
Four

Another hair pulling solution through PCRE verb (*SKIP)(*F),

$ grep -oP '[^"]+(?=(?:"[^"]*"[^"]*)*[^"]*$)(*SKIP)(*F)|\s*\K[^"]+(?=\b\s*)' file
One
Two
Three
Four

21,848

Author by

αғsнιη

SeniorDevOpsEngineer at #HUAWEI since March-2015 (#opentowork https://www.linkedin.com/in/-rw-r--r--) ʷⁱˡˡⁱⁿᵍ ᵗᵒ ˢᵉᵉ ʸᵒᵘ ⁱⁿ ᵃ ᵐⁱʳʳᵒʳ ᵐᵃᵈᵉ ᵒᶠ ᵐʸ ᵉʸᵉˢ # touch 'you ◔◡◔'

Updated on September 18, 2022

Comments

αғsнιη over 1 year
I'm trying to find all patterns between a pair of double quotes. Let say I have a file with contents look like as following:
```
first matched is "One". the second is here"Two "
and here are in second line" Three ""Four".
```
I want to below words as output:
```
One
Two
Three
Four
```
As you can see all strings in output are between a pair of quotes.

What I tried, is this command:
```
grep -Po ' "\K[^"]*' file
```
Above command works fine if I have a space before first pair of " marks. For example it works if my input file contains the following:
```
first matched is "One". the second is here "Two "
and here are in second line " Three " "Four".
```
I know I can do this with multiple commands combination. But I'm looking for one command and without using that for multiple time. e.g: below command
```
grep -oP '"[^"]*"' file | grep -oP '[^"]*'
```
How can I achieve/print all of my patterns just using one command?

Reply to comments: It's not important for me to removing whitespace around matched pattern inside a pair of quotes, but it would be better if the command support it too. and also my files contain nested quotes like "foo "bar" zoo". And all of the quoted words are in separate lines and they are not expanded to multi lines.

Thanks in advance.
- Admin over 9 years
  
  Can you have nested quotes? Things like "foo "bar""? If yes, how should those be dealt with?
- Admin over 9 years
  
  @terdon I wrote I think "One". the second is here "Two " and also " Three ""Four" are nested. isn't it?
- Admin over 9 years
  
  No, nested would be where the first quote includes the second. Yours are just next to each other. Nested: "foo "bar" baz", not nested:` "foo""bar"`.
- Admin over 9 years
  
  Is it possible for the quoted text to contain newlines?
- Admin over 9 years
  
  @KasiyA could you post a single example which satisfies all your needs along with the expected output?
- Admin over 9 years
  
  @KasiyA added an answer, check it :-)
αғsнιη over 9 years

Is there any option to remove last printed char like \b for example in c++.
terdon over 9 years

@KasiyA where? What printed char? From which of the suggestions?
αғsнιη over 9 years

Thank you Glenn my this command grep -Po ' "\K[^"]*' file works if I have a single space before first left pair of "s in my input file. Is there any replace regex that I change space here ... -Po '[HERE]"\K ... with that regex. replacing space char to match for all chars like [a-zA-Z]
terdon over 9 years

@KasiyA no. The problem is that grep will match the One and print it. Then, the remaining text is ". the second is here" which also matches. I don't think that grep's PCRE engine has any way of avoiding that.
terdon over 9 years

@KasiyA to do it without scripting, just use the grep/tr suggestion. Remember that pipes are The UNIX Way®, there's no reason to avoid them. You can't do it in grep (at least I don't think so) because grep will start matching again where the last match ended, which means that after the first hit, everything will start with a ".
fiatux over 9 years

which is why I wrote that the expression must consume the trailing quote.
terdon over 9 years

@glennjackman exactly. Do you have any idea if that's possible in grep?