Unclosed character class near index nnn

25,808

Solution 1

@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []] or [^]] are okay because the ] is the first character other than the negating ^, but in Java an unescaped [ anywhere in a character class is a syntax error.

Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes / because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:

"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"

Whether it's the best regex I have no idea, not knowing how it's being used.

Solution 2

I'm not sure exactly where your problem lies, but this might help:

In Java (and I believe this is unique to Java), the [ symbol (not just the ] symbol) is reserved inside character classes and needs to be escaped.

The revised expression should probably be similar to the following, in order to be Java-compatible:

(^|(?<=[\s>.\(])|[{\[]) # $pre
"                       # start
(' . $this->c . ')      # $atts
([^"]+?)                # $text
(?:\(([^)]+?)\)(?="))?  # $title
":
('.$this->urlch.'+?)    # $url
(\/)?                   # $slash
([^\w\/;]*?)            # $post
([\]}]|(?=\s|$|\)))
/x

Basically, any place where most regex flavors will allow a character class like [a-z,;[\]+-] - which would match "either a letter a-z or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be [a-z,;\[\]+-] (escape the [ with a \ character)

This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.

Share:
25,808
javafueled
Author by

javafueled

Tim is a husband, father, blogger, RPG gamemaster, wargamer, homebrewer, OSS software geek, and independent thinker. Tim won the 2011 Best of the Fans recognition by The Altdorf Correspondent for his 26 WFRP entries in the Blogging A to Z 2011 web event. He will use "Award Winning" from time-to-time to remind you of this recognition. Tim marked five years as participant of the Blogging A to Z web event, collaborating internationally with fans of Warhammer. Tim's gaming grognard badges include managing an FLGS from 1986-1988, running TMNT:OS, Twilight:2000 (being a child of the Cold War) and writing a Twilight:2000 chargen in BASIC on a TRS-80 CoCo II, enjoying lots WFRP 1ed and WH40K:Rogue Trader in college, and refereeing a Shadowrun 1ed tournament at the Atlanta Fantasy Fair in 1990. Tim can be found on StackExchange, lurking on RPG, homebrew, and stackoverflow, or on Untappd and GitHub. Tim also asks that your profile be complete enough that he will be able to know that you like RPGs, OSS, and homebrewing too. Tim's gaming interests include WFRP 1e and 2e, Twilight:2000, D&amp;D BECMI, 1e and 5e, and some others. Tim's homebrewing experience is almost 20 years old—he doesn't subtract the 12 idle years between getting married and reigniting his passion in the hobby. He brewed Night Gaunt in 2015 for a local gaming con, uniting two hobbies. Annual batches of Night Gaunt have been brewed since.

Updated on November 15, 2020

Comments

  • javafueled
    javafueled over 3 years

    I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).

    Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback in PHP) fail in Java with the following exception:

    java.util.regex.PatternSyntaxException: Unclosed character class near index 217

    The statement is obvious, the solution is elusive.

    Here's the raw, multiline regex from the PHP implementation:

    return preg_replace_callback('/
        (^|(?<=[\s>.\(])|[{[]) # $pre
        "                      # start
        (' . $this->c . ')     # $atts
        ([^"]+?)               # $text
        (?:\(([^)]+?)\)(?="))? # $title
        ":
        ('.$this->urlch.'+?)   # $url
        (\/)?                  # $slash
        ([^\w\/;]*?)           # $post
        ([\]}]|(?=\s|$|\)))
        /x',callback,input);
    

    Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo that resulted in the following, rather long, regular expression:

    (^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))
    

    I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.

    I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.

    Any ideas?

    I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.

    In #title switch the escaped paren:

            (?:\(([^)]+?)\)(?="))? # $title
            ...^
            (?:(\([^)]+?)\)(?="))? # $title
            ....^
    

    Thanks, Tim

    edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...

    "(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"