What literal characters should be escaped in a regex?

24,577

Solution 1

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • \ (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([\w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [\^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]

Solution 2

[\w.-]
  • the . usually means any character but between [] has no special meaning
  • - between [] indicates a range unless if it's escaped or either first or last character between []

Solution 3

While there are indeed some characters should be escaped in a regex, you're asking not about regex but about character class. Where dash symbol being special one.

instead of escaping it you could put it at the end of class, [\w.-]

Solution 4

The full stop loses its meta meaning in the character class.

The - has special meaning in the character class. If it isn't placed at the start or at the end of the square brackets, it must be escaped. Otherwise it denotes a character range (A-Z).

You triggered another special case however. [\w-.] works because \w does not denote a single character. As such PCRE can not possibly create a character range. \w is a possibly non-coherent class of symbols, so there is no end-character which could be used to create the range Z till .. Also the full stop . would preceed the first ascii character a that \w could match. There is no range constructable. Hencewhy - worked without escaping for you.

Share:
24,577

Related videos on Youtube

Pelle
Author by

Pelle

My name is Pelle (pronounced as in San Pellegrino), I am a Dutch software engineer who currently lives and works in Amsterdam, The Netherlands. I work full-time at Vinebase and am not actively looking for new opportunities. In 2013, I co-founded and helped build Occasion in 2013. My current day-to-day technologies that I feel at home with, are Ruby on Rails 6, PostgreSQL. An ex-recording engineer with a degree in Music and Technology, my passion for music (classical, jazz, and dance music from all around the globe).

Updated on July 09, 2022

Comments

  • Pelle
    Pelle almost 2 years

    I just wrote a regex for use with the php function preg_match that contains the following part:

    [\w-.]
    

    To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

    [\w\-.]
    

    Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?

    • Okonomiyaki3000
      Okonomiyaki3000 over 7 years
      Is there any reason not to use preg_quote?
    • Pelle
      Pelle over 7 years
      Probably not. But that's not why I asked the question. I was trying to learn something new about regular expressions, just using preg_quote would have the exact opposite effect. :). I do realise I tagged this PHP, but I was looking for an answer that may apply to any PCRE implementation.
    • Okonomiyaki3000
      Okonomiyaki3000 over 7 years
      I see. Then, may I suggest: github.com/php/php-src/blob/…
    • Pelle
      Pelle over 7 years
      While it still doesn't tell me "directly" what and what not to escape, and why, it does hold all the answers as to how it behaves. For reference, a mirror of the official source: github.com/luvit/pcre2/tree/master/src
  • Pelle
    Pelle about 13 years
    Very comprehensive answer, thanks. One question about []]: If you have only one character in the class, why not specify it as \]? (i.e. not between brackets)
  • Your Common Sense
    Your Common Sense about 13 years
    @Pelle "why not" is another question, irrelevant one. "There is more than one way to do it" is a motto of inventor of preg ;)
  • Pelle
    Pelle about 13 years
    Does the . really mean 'any character' while in a character class? (i.e. between brackets)
  • Bart Kiers
    Bart Kiers about 13 years
    @Pelle, thanks. True, you could (or should?) simply use \] instead of a character class, but I wanted to mention that many regex implementations allow []] to match a literal ]. You don't even need to escape the ], since it is only a meta character inside a character class. Outside of it, only [ needs to be escaped from the two square brackets (but escaping ] doesn't hurt!).
  • bw_üezi
    bw_üezi about 13 years
    @Pelle that's true. I'm just editing the answer. most of the answers got that wrong ;-)
  • AFA Med
    AFA Med about 6 years
    The character used to wrap/delimit the RegExp must be escaped, typically '/'.
  • Bart Kiers
    Bart Kiers about 6 years
    @AFAMed, that is a language restriction, not specific to regex itself.
  • AFA Med
    AFA Med about 6 years
    I totally agree with you, I added this small note because the question is tagged with PHP.