Remove C and C++ comments using Python?

c++ python c regex comments

42,203

Solution 1

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:

import subprocess
from cStringIO import StringIO

input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
    input=input, output=output)
return_code = process.wait()

stripped_code = output.getvalue()

In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.

This will probably be better than a pure Python solution; no need to reinvent the wheel.

Solution 2

This handles C++-style comments, C-style comments, strings and simple nesting thereof.

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):
            return " " # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

Strings needs to be included, because comment-markers inside them does not start a comment.

Edit: re.sub didn't take any flags, so had to compile the pattern first.

Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.

Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.

Solution 3

C (and C++) comments cannot be nested. Regular expressions work well:

//.*?\n|/\*.*?\*/

This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.

def stripcomments(text):
    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

This code should work.

/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:

//.*?(\r\n?|\n)|/\*.*?\*/

This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).

/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.

Solution 4

Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...

" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"

"And escaped double quotes at the end of a string\""

aa '\\
n' OK
aa "\""
aa "\
\n"

This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.

This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.

This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++  comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).

Solution 5

This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)

To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)

This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).

import re

def removeCCppComment( text ) :

    def blotOutNonNewlines( strIn ) :  # Return a string containing only the newline chars contained in strIn
        return "" + ("\n" * strIn.count('\n'))

    def replacer( match ) :
        s = match.group(0)
        if s.startswith('/'):  # Matched string is //...EOL or /*...*/  ==> Blot out all non-newline chars
            return blotOutNonNewlines(s)
        else:                  # Matched string is '...' or "..."  ==> Keep unchanged
            return s

    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )

    return re.sub(pattern, replacer, text)

View more solutions

42,203

Author by

TomZ

Game Developer with a background in Mathematics

Updated on July 09, 2022

Comments

TomZ almost 2 years

I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)

I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.

Ideally, I would prefer a non-naive implementation that properly handles awkward cases.
jfs over 15 years

1. use $ and re.MULTILINE instead of `'\n', '\r\n', etc
Adam Rosenfield over 15 years

This doesn't handle the case of a line ending in a backslash, which indicates a continued line, but that case is extremely rare
Brian over 15 years

You've missed the replacement blank string in the re.sub. Also, this won't work for strings. Eg. consider 'string uncPath = "//some_path";' or 'char operators[]="/*+-";' For language parsing, I think you're best off using a real parser.
nobody over 15 years

Also, as Alex Coventry mentions, simple regexes will hose string literals that happen to contain comment markers (which is perfectly legal).
Jonathan Leffler over 15 years

Your code doesn't handle abuse of comments, such as a backslash-newline in between the two start-of-comment symbols, or between the star-slash that ends a classic C-style comment. There's a strong sense in which it "doesn't matter; no-one in their right mind writes comments like that". YMMV.
Konrad Rudolph over 15 years

@Jonathan: Wow, I didn't think this would compile. Redefines the meaning of “lexeme”. By the way, are there syntax highlighters (IDEs, code editors) that support this? Neither VIM nor Visual Studio do.
Brian over 15 years

This doesn't handle escaped " chars in strings. eg: char some_punctuation_chars=".\"/"; /* comment */
sedavidw over 15 years

Yes it does. \\. will match any escaped char, including \".
Pure Jobs Inc. over 14 years

"C (and C++) comments cannot be nested." Some compilers (well, at least Borland's (free) version 5.5.1) allow nested C-style comments via a command line switch.
atikat about 14 years

Also you can preserve line numbering relative to the input file by changing the first return to: return "" + "\n" * s.count('\n') I needed to do this in my situation.
Jonathan Leffler almost 14 years

I would also add that if anyone wrote a comment with the comment start or end symbols split over lines, I'd persuade them of the error of their ways. And extending a single-line comment with a trailing backslash is also similarly evil. So, the problems here are more imaginary than real - unless you're a C compiler writer.
frankster almost 12 years

good thinking, although its a shame it does more than just remove comments!
Stephen Niedzielski about 11 years

Don't introduce an additional script and tool dependency to your Python script by using Sed. Choose Sed or Python, not both.
robocat about 11 years

That surely fails if there is a // or /* within a string, or within a / delimited regular expression.
robocat about 11 years

So I think it would fail on various RegExp strings (e.g. /\// or /\/*/ or /'/; //blah) and multiline strings (davidwalsh.name/multiline-javascript-strings). i.e. usable for simple code, but probably not for larger production codebases. If I had to use Python I would look for solutions using pynoceros or pynarcissus. If you can use node.js then UglifyJS2 is a good base for munging JavaScript code.
sedavidw about 11 years

@robocat True. But Regex literals are not part of the C language. If you wish to parse code with Regex literals, you could add this at the end of the Regex: |/(?:\\.|[^\\/])+/. The condition in the replacer() function would also have to be tweaked.
robocat almost 11 years

@markus-jarderot - Good point! I forgot it was C because I was looking for an ECMAScript solution! With C the regex can also fail on preprocessor statements (removing lines beginning with # is probably an easy fix for that issue though) so as it stands it doesn't solve "properly handles awkward cases". Also doesn't C have multiline strings using \ and does this handle those?
sedavidw almost 11 years

@robocat It does handle escapes, but not pre-processor statements. Neither does it handle Digraphs and trigraphs, but that is normally not a problem. For pre-processor statements, you could add |#[^\r\n]*(?:\\\r?\n[^\r\n]*)* at the end of the regex.
slottermoser almost 11 years

No it doesn't. It's looking for /** */ style java block comments, as stated in the description. It doesn't handle // or /* or even /... it isn't perfect, but it doesn't "fail", just ignores the cases you stated. It was just a reference for anyone looking for something similar.
sim642 about 8 years

This is the only response which doesn't involve an ugly hack.
Mark Smith about 6 years

This fails for me (python2 and python3) on the simple string blah "blah" with error TypeError: sequence item 1: expected string, module found.
rfportilla about 5 years

Opening up another process is not good. It is expensive and risky. I suggest sticking with pure python.
tripleee almost 5 years

But it also doesn't really answer the question.
Samuel Chen over 4 years

It's not python. It's shell. How if on window ?
Aman Deep almost 4 years

It leaves a newline after removing a multi line comment. any fix for this?
sedavidw almost 4 years

@AmanDeep You could add [^\S\r\n]*\r?\n? after \*/ to include whitespace up until and including the following newline, if any.
Michael Donahue almost 3 years

It's been some time since this answer was posted, but I just wanted to say that I found it extremely useful. I've been experimenting with Thiago's solution above, but wanted to note that if you're parsing C code you may want to use the following import instead of the one leveraging pygments.lexers.c_like: from pygments.lexers.c_cpp import CLexer. I'm still experimenting with this, but using the former discarded pre-processor definitions for me.
Michael Donahue almost 3 years

Here's a link to the lexers available