Regular expression to detect semi-colon terminated C++ for & while loops

124,619

Solution 1

You could write a little, very simple routine that does it, without using a regular expression:

  • Set a position counter pos so that is points to just before the opening bracket after your for or while.
  • Set an open brackets counter openBr to 0.
  • Now keep incrementing pos, reading the characters at the respective positions, and increment openBr when you see an opening bracket, and decrement it when you see a closing bracket. That will increment it once at the beginning, for the first opening bracket in "for (", increment and decrement some more for some brackets in between, and set it back to 0 when your for bracket closes.
  • So, stop when openBr is 0 again.

The stopping positon is your closing bracket of for(...). Now you can check if there is a semicolon following or not.

Solution 2

This is the kind of thing you really shouldn't do with a regular expression. Just parse the string one character at a time, keeping track of opening/closing parentheses.

If this is all you're looking for, you definitely don't need a full-blown C++ grammar lexer/parser. If you want practice, you can write a little recursive-decent parser, but even that's a bit much for just matching parentheses.

Solution 3

This is a great example of using the wrong tool for the job. Regular expressions do not handle arbitrarily nested sub-matches very well. What you should do instead is use a real lexer and parser (a grammar for C++ should be easy to find) and look for unexpectedly empty loop bodies.

Solution 4

Try this regexp

^\s*(for|while)\s*
\(
(?P<balanced>
[^()]*
|
(?P=balanced)
\)
\s*;\s

I removed the wrapping \( \) around (?P=balanced) and moved the * to behind the any not paren sequence. I have had this work with boost xpressive, and rechecked that website (Xpressive) to refresh my memory.

Solution 5

A little late to the party, but I think regular expressions are not the right tool for the job.

The problem is that you'll come across edge cases which would add extranous complexity to the regular expression. @est mentioned an example line:

for (int i = 0; i < 10; doSomethingTo("("));

This string literal contains an (unbalanced!) parenthesis, which breaks the logic. Apparently, you must ignore contents of string literals. In order to do this, you must take the double quotes into account. But string literals itself can contain double quotes. For instance, try this:

for (int i = 0; i < 10; doSomethingTo("\"(\\"));

If you address this using regular expressions, it'll add even more complexity to your pattern.

I think you are better off parsing the language. You could, for instance, use a language recognition tool like ANTLR. ANTLR is a parser generator tool, which can also generate a parser in Python. You must provide a grammar defining the target language, in your case C++. There are already numerous grammars for many languages out there, so you can just grab the C++ grammar.

Then you can easily walk the parser tree, searching for empty statements as while or for loop body.

Share:
124,619
Thomi
Author by

Thomi

Impatient software developer - Programming professionally by day, and mixing up stupendous projects of awesomeness by night!

Updated on July 23, 2020

Comments

  • Thomi
    Thomi almost 4 years

    In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

    for (int i = 0; i < 10; i++);
    

    ... but not this:

    for (int i = 0; i < 10; i++)
    

    This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

    for (int i = funcA(); i < funcB(); i++);
    

    I'm using the python.re module. Right now my regular expression looks like this (I've left my comments in so you can understand it easier):

    # match any line that begins with a "for" or "while" statement:
    ^\s*(for|while)\s*
    \(  # match the initial opening parenthesis
        # Now make a named group 'balanced' which matches a balanced substring.
        (?P<balanced>
            # A balanced substring is either something that is not a parenthesis:
            [^()]
            | # …or a parenthesised string:
            \( # A parenthesised string begins with an opening parenthesis
                (?P=balanced)* # …followed by a sequence of balanced substrings
            \) # …and ends with a closing parenthesis
        )*  # Look for a sequence of balanced substrings
    \)  # Finally, the outer closing parenthesis.
    # must end with a semi-colon to match:
    \s*;\s*
    

    This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

    for (int i = 0; i < 10; doSomethingTo(i));
    

    I think it breaks because as soon as you put some text between the opening and closing parenthesis, the "balanced" group matches that contained text, and thus the (?P=balanced) part doesn't work any more since it won't match (due to the fact that the text inside the parenthesis is different).

    In my Python code I'm using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

    REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
    ^\s*(for|while)\s*
    \(  # match the initial opening parenthesis
        # Now make a named group 'balanced' which matches
        # a balanced substring.
        (?P<balanced>
            # A balanced substring is either something that is not a parenthesis:
            [^()]
            | # …or a parenthesised string:
            \( # A parenthesised string begins with an opening parenthesis
                (?P=balanced)* # …followed by a sequence of balanced substrings
            \) # …and ends with a closing parenthesis
        )*  # Look for a sequence of balanced substrings
    \)  # Finally, the outer closing parenthesis.
    # must end with a semi-colon to match:
    \s*;\s*"""
    
    REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)
    

    Can anyone suggest an improvement to this regular expression? It's getting too complicated for me to get my head around.

  • JaredPar
    JaredPar about 15 years
    +1, Strictly speaking, regex's don't handle nested expressions at all. Regular expressions which handle nested expressions have transcended into context free grammars.
  • Frank
    Frank about 15 years
    I agree with using flex/yacc or similar. But is a C++ grammar really easy to find? Does anyone have a link? I remember the folks from CDT/Eclipse had a hard to actually parsing the C++ input correctly and fast.
  • Greg Hewgill
    Greg Hewgill about 15 years
    Perhaps not; C++ is of course notoriously difficult to parse. Since the original question doesn't require full semantic analysis of the input source, a simpler, incomplete parser could probably do the job just as well.
  • Frank
    Frank about 15 years
    That's probably not sufficient because people do split for() statements over multiple lines.
  • Bill Perkins
    Bill Perkins about 15 years
    Actually, with boost:xpressive and possibly python, you can have regexp that perform balanced paren matching.
  • Bill Perkins
    Bill Perkins about 15 years
    Actually, with boost:xpressive, you can have regexp that perform balanced paren matching.
  • Amit Patil
    Amit Patil about 15 years
    +1. Of course we're talking Python here so the syntax is trivially different. But if you're not actually parsing the C properly, there's no reason to look for anything other than ‘);’ at the end of a ‘for’ line.
  • Thomi
    Thomi about 15 years
    dehmann is correct - the idea is that the pattern matches examples from a real code-base, so it must be able to handle all valid for loop constructs, including multi-line ones.
  • Martin York
    Martin York about 15 years
    You also need to take into account comments and strings, both of which will throw of this algorithm.
  • Frank
    Frank about 15 years
    You can remove the comments and strings beforehand with a regular expression. :) Or introduce more variables like openBr, that indicate if you're inside a comment (and what type of comment, so you know what character closes it) or a string.
  • est
    est about 13 years
    Troll line: for (int i = 0; i < 10; doSomethingTo("("));
  • onetwopunch
    onetwopunch over 10 years
    I implemented a similar algorithm to parse functions from C, but i'm running into issues with brackets inside preprocessor directives like #ifdef. Any ideas on how to solve this?
  • ocodo
    ocodo over 10 years
    You have to check for each special case. This starts to become a more hefty parsing problem when you throw specials at it.
  • pilau
    pilau over 9 years
    Here's a lightweight Javascript implementation of Frank's algorithm, if anyone's interested
  • MokiNex
    MokiNex about 6 years
    @Frank could help about that I have the same problem with the Bracket stackoverflow.com/questions/49381553/…
  • oo92
    oo92 over 2 years
    Your approach would and does fail on this: (}(}