Right way to escape backslash [ \ ] in PHP regex?

32,301

Solution 1

The thing is, you're using a character class, [], so it doesn't matter how many literal backslashes are embedded in it, it'll be treated as a single backslash.

e.g. the following two regexes:

/[a]/
/[aa]/

are for all intents and purposes identical as far as the regex engine is concerned. Character classes take a list of characters and "collapse" them down to match a single character, along the lines of "for the current character being considered, is it any of the characters listed inside the []?". If you list two backslashes in the class, then it'll be "is the char a blackslash or is it a backslash?".

Solution 2

// PHP 5.4.1
// Either three or four \ can be used to match a '\'.
echo preg_match( '/\\\/', '\\' );        // 1
echo preg_match( '/\\\\/', '\\' );       // 1
// Match two backslashes `\\`.
echo preg_match( '/\\\\\\/', '\\\\' );   // Warning: No ending delimiter '/' found
echo preg_match( '/\\\\\\\/', '\\\\' );  // 1
echo preg_match( '/\\\\\\\\/', '\\\\' ); // 1
// Match one backslash using a character class.
echo preg_match( '/[\\]/', '\\' );       // 0
echo preg_match( '/[\\\]/', '\\' );      // 1  
echo preg_match( '/[\\\\]/', '\\' );     // 1

When using three backslashes to match a '\' the pattern below is interpreted as match a '\' followed by an 's'.

echo preg_match( '/\\\\s/', '\\ ' );    // 0  
echo preg_match( '/\\\\s/', '\\s' );    // 1  

When using four backslashes to match a '\' the pattern below is interpreted as match a '\' followed by a space character.

echo preg_match( '/\\\\\s/', '\\ ' );   // 1
echo preg_match( '/\\\\\s/', '\\s' );   // 0

The same applies if inside a character class.

echo preg_match( '/[\\\\s]/', ' ' );   // 0 
echo preg_match( '/[\\\\\s]/', ' ' );  // 1 

None of the above results are affected by enclosing the strings in double instead of single quotes.

Conclusions:
Whether inside or outside a bracketed character class, a literal backslash can be matched using just three backslashes '\\\' unless the next character in the pattern is also backslashed, in which case the literal backslash must be matched using four backslashes.

Recommendation:
Always use four backslashes '\\\\' in a regex pattern when seeking to match a backslash.

Escape sequences.

Solution 3

To avoid this kind of unclear code you can use \x5c Like this :)

echo preg_replace( '/\x5c\w+\.php$/i', '<b>${0}</b>', __FILE__ );

Solution 4

I've studied this years ago. That's because 1st backslash escapes the 2nd one and they together form a 'true baclkslash' character in pattern and this true one escapes the 3rd one. So it magically makes 3 backslashes work.

However, normal suggestion is to use 4 backslashes instead of the ambiguous 3 backslashes.

If I'm wrong about anything, please feel free to correct me.

Solution 5

The answer https://stackoverflow.com/a/15369828/2311074 is very illustrative, but if you don't know the core problem of backslashes in PHP string you won't understand it at all.

The core problem of backslashen in PHP strings is explained at https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single You may want to pay attention to the last two sentences:

The simplest way to specify a string is to enclose it in single quotes (the character ').

To specify a literal single quote, escape it with a backslash ().To specify a literal backslash, double it (\). All other instances of backslash will be treated as a literal backslash

So in short, two backslashes in a string represent a literal backslash. A single backslash not followed by a ' also represents a literal backslash.

This is a bit odd, but it means a string '\\xxx' and '\xxx' both represent the same string \xxx.
Note, that '\\'xxx' is an invalid string whereas '\'xxx' represents the string 'xxx.

I guess it originates from this: If you want to have a literal single quote, you need to escape it with backslash. So 'hi\'' represents the string hi'. But now you end up in the situation that you maybe want to create the string hi\ but 'hi\' would not work anymore (invalid string like this without ending '). Therefore, one needed an extra escape to prevent the special meaning from \ Thus, one decided \ escapes \ and hi\ can be written by 'hi\\'.

And this is the reason why '\\\' is the same as '\\\\' (both represent \\) and for those two strings it does not matter at all what you use.

However, it has the surprising effect, that if you double the strings, they are not the same. This is because 3 backslashes enclosed in single quotes represent 2 literal backslashes. But 6 backslashes enclosed in single quotes represent only 3 literal backslashes. Whereas 4 backslashes enclosed in single quotes represent 2 literal backslashes and 8 backslashes enclosed in single quotes represent 4 literal (see examples from MikeM). Thus, its recommended to always use 4 instead of 3.

Share:
32,301
Admin
Author by

Admin

Updated on December 06, 2021

Comments

  • Admin
    Admin 10 months

    Just out of curiosity, I'm trying to figure out which exactly is the right way to escape a backslash for use in a PHP regular expression pattern like so:

    TEST 01: (3 backslashes)

    $pattern = "/^[\\\]{1,}$/";
    $string = '\\';
    // ----- RETURNS A MATCH -----
    

    TEST 02: (4 backslashes)

    $pattern = "/^[\\\\]{1,}$/";
    $string = '\\';
    // ----- ALSO RETURNS A MATCH -----
    

    According to the articles below, 4 is supposedly the right way but what confuses me is that both tests returned a match. If both are right, then is 4 the preferred way?

    RESOURCES:

  • Admin
    Admin over 10 years
    So in both cases, the regex engine considers it a single backslash?
  • Marc B
    Marc B over 10 years
    \[\] would be an escape of the closing bracket. [\\] would be a backslash in a character class. a single char class is rather pointless, it'd be no different than just having a bare `\\`.
  • CMCDragonkai
    CMCDragonkai almost 9 years
    When I try [\], I always get Message: preg_match(): Compilation failed: missing terminating ] for character class at offset 3
  • Lightness Races in Orbit
    Lightness Races in Orbit over 8 years
    -1: and this true one escapes the 3rd one Nope. Only one pass is performed. The third backslash "escapes" the ] (which just results in the ] on its own).
  • Scott Chu
    Scott Chu over 8 years
    @ Lightness: Then why '/(\\\r)\1+/' will match repeated '\' and 'r' (2 true characters, I mean)? Can you explain?
  • Lightness Races in Orbit
    Lightness Races in Orbit over 8 years
    \r is an escape sequence; \] is not.
  • Alex Skrypnyk
    Alex Skrypnyk over 5 years
    I just want to say huge thank you for this. Escaping escape characters like \n is a pain already, but doing it in regex with lookbehind is a challenge.
  • Cholthi Paul Ttiopic
    Cholthi Paul Ttiopic over 4 years
    Avoiding a back slash only to replace with another three characters and a back slash its self again. Phew!
  • Bjarke
    Bjarke about 1 year
    This unfortunately won't escape backslashes, you'll still need \\\\ to match \ in your search string.
  • Gershom Maes
    Gershom Maes 11 months
    @MarcB For me, preg_replace('/[\\]/', '/', 'a\\b\\c.txt') still results in the "compilation failed" error.
  • Alex78191
    Alex78191 10 months
    it's not true. you must escape backslashes
  • Alex78191
    Alex78191 10 months
    @Bjarke why not \\\ ?
  • Alex78191
    Alex78191 10 months
    you should write [\\\\]
  • Alex78191
    Alex78191 10 months
    @CholthiPaulTtiopic two backslashes wouldn't work, you should write 4 backslashes. onlinegdb.com/3IclPtzxW
  • Alex78191
    Alex78191 10 months
    in Python the same