Regex to match all instances not inside quotes

47,306

Solution 1

Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.

The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:

\+(?=([^"]*"[^"]*")*[^"]*$)

Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at

\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)

I admit it is a little cryptic. =)

Solution 2

Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.

There happens to be a simple, general solution that wasn't mentioned.

Compared with alternatives, the regex for this solution is amazingly simple:

"[^"]+"|(\+)

The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:

<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
    if (!group1) return m;
    else return "#";
});
document.write(replaced);

Online demo

You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.

Hope this gives you a different idea of a very general way to do this. :)

What about Empty Strings?

The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:

"[^"]*"|(\+)

See demo.

What about Escaped Quotes?

Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.

Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"

The resulting expression has three branches:

  1. \\" to match and ignore
  2. "(?:\\"|[^"])*" to match and ignore
  3. (\+) to match, capture and handle

Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.

The full regex becomes:

\\"|"(?:\\"|[^"])*"|(\+)

See regex demo and full script.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...

Solution 3

You can do it in three steps.

  1. Use a regex global replace to extract all string body contents into a side-table.
  2. Do your comma translation
  3. Use a regex global replace to swap the string bodies back

Code below

// Step 1
var sideTable = [];
myString = myString.replace(
    /"(?:[^"\\]|\\.)*"/g,
    function (_) {
      var index = sideTable.length;
      sideTable[index] = _;
      return '"' + index + '"';
    });
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
    function (_, index) {
      return sideTable[index];
    });

If you run that after setting

myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';

you should get

{:a "ab,cd, efg"
 :b "ab,def, egf,"
 :c "Conjecture"}

It works, because after step 1,

myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];

so the only commas in myString are outside strings. Step 2, then turns commas into newlines:

myString = '{:a "0"\n :b "1"\n :c "2"}'

Finally we replace the strings that only contain numbers with their original content.

Solution 4

Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:

var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';

and

var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;

Also the already mentioned "group1 === undefined" or "!group1". Especially 2. seems important to actually take everything asked in the original question into account.

It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

Share:
47,306
Azmisov
Author by

Azmisov

Updated on July 05, 2022

Comments

  • Azmisov
    Azmisov almost 2 years

    From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.

    If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.

    Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.

    For Example:
    An input string of:
    +bar+baz"not+or\"+or+\"this+"foo+bar+
    replacing + with #, not inside quotes, would return:
    #bar#baz"not+or\"+or+\"this+"foo#bar#

  • Azmisov
    Azmisov almost 13 years
    Thank you! Didn't think it was possible. I understand 100% of the theory, about 60% of the regex, and I'm down to 0% when it comes to writing it on my own. Oh, well, maybe one of these days.
  • Azmisov
    Azmisov almost 13 years
    +1 for an elegant non-regex solution. The regex is a bit more flexible for what I'm doing, though.
  • Azmisov
    Azmisov almost 13 years
    Hey, is there any way to make the regex work with JavaScript's .split() method? It seems to be ignoring the global flag...
  • Azmisov
    Azmisov almost 13 years
    Nevermind, just forgot to put the ?: inside all the parentheticals: \+(?=(?:[^"\\]*(?:\\.|"(?:[^"\\]*\\.)*[^"\\]*"))*[^"]*$)
  • Scorpion
    Scorpion over 11 years
    +1, this is excellent. what is the quotes could be either within double "" or single quotes''. example input +bar+baz'not+or\'+or+\"this+"foo+bar+. Also can you add some explanation to the steps of the regex.
  • anson
    anson almost 11 years
    Tried using this in a project and it failed. I found the cause was if you had a single doublequote inside two singlequotes '"'...This would cause the number of double quotes in the string to be odd
  • Jens
    Jens almost 11 years
    For this expression, single quotes have no special meaning. It fails by design in your case.
  • jcollum
    jcollum about 10 years
    On that last regex it looks like the parens aren't matching. I see 4 opens and 6 closes.
  • Gildor
    Gildor over 9 years
    This approach is actually better than the look-ahead way suggested by @Jens. It's easier to write and has much better performance. I didn't notice and used the look-ahead way until I hit a performance issue that to match a 1.5M text the look-ahead way used about 90 seconds while this approach only needed 600ms.
  • Gildor
    Gildor over 9 years
    Everyone please take a look at the solution suggested by @zx81 in his answer. That's easier to write and has much better performance if can be used.
  • Jens
    Jens over 9 years
    Yeah, this is better =)
  • shennan
    shennan almost 9 years
    I found that this only worked when changing the 5th line of your example to if (group1 === undefined ) return m;. Worth noting that I was searching for spaces; not plus signs.
  • Pomme.Verte
    Pomme.Verte about 8 years
    How would you avoid escaped quotes using this? Is it even possible with this pattern?
  • Brian Low
    Brian Low about 8 years
    This seems to fail on double-quotes with no content "" and escaped quotes \". regex101.com/r/yR7xV5/1
  • zx81
    zx81 about 8 years
    @BrianLow You're right. The answer was meant to demonstrate the technique in the simplest way possible. I've expanded it in response to your comment (see the "What about Empty Strings?" and "What about Escaped Quotes?" sections.
  • zx81
    zx81 about 8 years
    @D.Mill Sorry for the delay, please see expanded answer.
  • justFatLard
    justFatLard over 3 years
    Thank you! I referenced your method (and this post) in my more specific solution: stackoverflow.com/a/64617472/3799617
  • Akin Hwan
    Akin Hwan over 2 years
    Doesn't this match all characters inside double quotes? I thought the question was how to match outside of quotes