Undocumented Java regex character class: \p{C}

10,648

Solution 1

Buried down in the Pattern docs under Unicode Support, we find the following:

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.

...

Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Same as scripts and blocks, categories can also be specified by using the keyword general_category (or its short form gc) as in general_category=Lu or gc=Lu.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative.

From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.


It probably should support \p{Other}, but apparently it doesn't.

Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:

To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.

There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.

Solution 2

According to https://regex101.com/, \p{C} matches

Invisible control characters and unused code points

(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})

I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.

Solution 3

Anything other than a valid two-letter Unicode category code or a single letter that begins a Unicode category code is illegal since Java supports only single letter and two-letter abbreviations for Unicode categories. That's why \p{Other} doesn't work here.

\p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO.

Right. Java uses UTF-16 encoding internally for Unicode characters and 💩 is encoded as two 16-bit code units (0xD83D 0xDCA9) called surrogate pairs (high surrogates) and since \p{C} matches each half separately

\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.

you see two matches in result set.

What is the likely intent of the original pattern, [\\p{C}&&\\S]?

I don't see a much valid reason but it seems developer worried about characters in category Other (like avoiding spammy goomojies in email's subject) so simply tried to block them.

Solution 4

As for the Bonus question: the expression [\\p{C}&&\\S] finds control characters excluding whitespace characters like tabs or line feeds in Java. These characters have no value in regular mails and therefore it is a good idea to filter them away (or, as in this case, declare an email content as faulty). Be aware that the double backslashes (\\) are only necessary to escape the expression for Java processing. The correct regular expression would be: [\p{C}&&\S]

Share:
10,648
doctaphred
Author by

doctaphred

Python Wrangler

Updated on July 23, 2022

Comments

  • doctaphred
    doctaphred almost 2 years

    I found an interesting regex in a Java project: "[\\p{C}&&\\S]"

    I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?

    The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:

    public class StrangePattern {
        public static void main(String[] argv) {
    
            // As far as I can tell, this is the simplest way to create a String
            // with code points above U+FFFF.
            String poo = new String(Character.toChars(0x1F4A9));
    
            System.out.println(poo);  // prints `💩`
            System.out.println(poo.replaceAll("\\p{C}", "?"));  // prints `??`
            System.out.println(poo.replaceAll("\\p{Cntrl}", "?"));  // prints `💩`
        }
    }
    

    The only mention I've found anywhere is here:

    \p{C} or \p{Other}: invisible control characters and unused code points.

    However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.

    My Java version info:

    $ java -version
    java version "1.8.0_92"
    Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
    Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
    

    Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.

  • Hulk
    Hulk about 7 years
    What are the sources for the first two statements you highlighted as quotes? Would be interesting because it seems to contradict the currently top voted answer stackoverflow.com/a/44034552/2513200
  • user2357112
    user2357112 about 7 years
    @Hulk: That flag is for a different set of character classes, specifically those listed under "Predefined character classes" and "POSIX character classes (US-ASCII only)". \p{C} isn't one of those.
  • Marcono1234
    Marcono1234 over 5 years
    Related bug reports: JDK-8179668, JDK-8029966