regex: remove all text within "double-quotes" (multiline included)

10,842

Solution 1

Try this expression:

"[^"]+"

Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

Solution 2

Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.

I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:

"[^"]*"

EDIT:

On second thought, here's a better one:

"[\S\s]*?"

This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"

The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.


there are many regexes that can solve your problem but here's one:

"(.*?(\s)*?)*?"

this reads as:

find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily

greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.

great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/

Remember that your regex may have to change slightly based on what language you're using to search using regex.

Share:
10,842
siliconpi
Author by

siliconpi

Updated on July 31, 2022

Comments

  • siliconpi
    siliconpi 4 months

    I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:

    $file=file_get_contents('test.html');
    $replaced = preg_replace('/"(\n.)+?"/m','', $file);
    

    I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.

    I read that newlines can be \r\n and \n as well.

  • NikiC
    NikiC over 11 years
    In PHP there's no g ;) That was JS you are probably remembering :) +1
  • mjec
    mjec over 11 years
    Indeed the key to making this regex work is the /s (DOTALL) modifier - similar to the g flag.
  • Alan Moore
    Alan Moore over 11 years
    The layout of this answer is very confusing. If you're saying "[\S\s]*?" is better than "(.*?(\s)*?)*?", and "[^"]*" is better still, then I agree. ;)
  • Alan Moore
    Alan Moore over 11 years
    @mjec. No, the key is using "[^"]+"`. With this regex you don't have to worry about match modes like DOTALL or MULTILINE, or whether your quantifiers are greedy or non-greedy.
  • mjec
    mjec over 11 years
    @Alan In PHP (which OP is using), without dotall i.e. /s that regex will not match across newlines.
  • Alan Moore
    Alan Moore over 11 years
    @mjec: Are you talking about "[^"]+"? Sure it will! A negated character class can always match newlines (assuming they're not among the listed characters, of course). /s changes the behavior the . metacharacter only.