Remove C and C++ comments using Python?

42,203

Solution 1

I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:

import subprocess
from cStringIO import StringIO

input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
    input=input, output=output)
return_code = process.wait()

stripped_code = output.getvalue()

In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.

This will probably be better than a pure Python solution; no need to reinvent the wheel.

Solution 2

This handles C++-style comments, C-style comments, strings and simple nesting thereof.

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):
            return " " # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

Strings needs to be included, because comment-markers inside them does not start a comment.

Edit: re.sub didn't take any flags, so had to compile the pattern first.

Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.

Edit3: Fixed the case where a legal expression int/**/x=5; would become intx=5; which would not compile, by replacing the comment with a space rather then an empty string.

Solution 3

C (and C++) comments cannot be nested. Regular expressions work well:

//.*?\n|/\*.*?\*/

This requires the “Single line” flag (Re.S) because a C comment can span multiple lines.

def stripcomments(text):
    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

This code should work.

/EDIT: Notice that my above code actually makes an assumption about line endings! This code won't work on a Mac text file. However, this can be amended relatively easily:

//.*?(\r\n?|\n)|/\*.*?\*/

This regular expression should work on all text files, regardless of their line endings (covers Windows, Unix and Mac line endings).

/EDIT: MizardX and Brian (in the comments) made a valid remark about the handling of strings. I completely forgot about that because the above regex is plucked from a parsing module that has additional handling for strings. MizardX's solution should work very well but it only handles double-quoted strings.

Solution 4

Don't forget that in C, backslash-newline is eliminated before comments are processed, and trigraphs are processed before that (because ??/ is the trigraph for backslash). I have a C program called SCC (strip C/C++ comments), and here is part of the test code...

" */ /* SCC has been trained to know about strings /* */ */"!
"\"Double quotes embedded in strings, \\\" too\'!"
"And \
newlines in them"

"And escaped double quotes at the end of a string\""

aa '\\
n' OK
aa "\""
aa "\
\n"

This is followed by C++/C99 comment number 1.
// C++/C99 comment with \
continuation character \
on three source lines (this should not be seen with the -C fla
The C++/C99 comment number 1 has finished.

This is followed by C++/C99 comment number 2.
/\
/\
C++/C99 comment (this should not be seen with the -C flag)
The C++/C99 comment number 2 has finished.

This is followed by regular C comment number 1.
/\
*\
Regular
comment
*\
/
The regular C comment number 1 has finished.

/\
\/ This is not a C++/C99 comment!

This is followed by C++/C99 comment number 3.
/\
\
\
/ But this is a C++/C99 comment!
The C++/C99 comment number 3 has finished.

/\
\* This is not a C or C++  comment!

This is followed by regular C comment number 2.
/\
*/ This is a regular C comment *\
but this is just a routine continuation *\
and that was not the end either - but this is *\
\
/
The regular C comment number 2 has finished.

This is followed by regular C comment number 3.
/\
\
\
\
* C comment */

This does not illustrate trigraphs. Note that you can have multiple backslashes at the end of a line, but the line splicing doesn't care about how many there are, but the subsequent processing might. Etc. Writing a single regex to handle all these cases will be non-trivial (but that is different from impossible).

Solution 5

This posting provides a coded-out version of the improvement to Markus Jarderot's code that was described by atikat, in a comment to Markus Jarderot's posting. (Thanks to both for providing the original code, which saved me a lot of work.)

To describe the improvement somewhat more fully: The improvement keeps the line numbering intact. (This is done by keeping the newline characters intact in the strings by which the C/C++ comments are replaced.)

This version of the C/C++ comment removal function is suitable when you want to generate error messages to your users (e.g. parsing errors) that contain line numbers (i.e. line numbers valid for the original text).

import re

def removeCCppComment( text ) :

    def blotOutNonNewlines( strIn ) :  # Return a string containing only the newline chars contained in strIn
        return "" + ("\n" * strIn.count('\n'))

    def replacer( match ) :
        s = match.group(0)
        if s.startswith('/'):  # Matched string is //...EOL or /*...*/  ==> Blot out all non-newline chars
            return blotOutNonNewlines(s)
        else:                  # Matched string is '...' or "..."  ==> Keep unchanged
            return s

    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )

    return re.sub(pattern, replacer, text)
Share:
42,203
TomZ
Author by

TomZ

Game Developer with a background in Mathematics

Updated on July 09, 2022

Comments

  • TomZ
    TomZ almost 2 years

    I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)

    I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.

    Ideally, I would prefer a non-naive implementation that properly handles awkward cases.

  • jfs
    jfs over 15 years
    1. use $ and re.MULTILINE instead of `'\n', '\r\n', etc
  • Adam Rosenfield
    Adam Rosenfield over 15 years
    This doesn't handle the case of a line ending in a backslash, which indicates a continued line, but that case is extremely rare
  • Brian
    Brian over 15 years
    You've missed the replacement blank string in the re.sub. Also, this won't work for strings. Eg. consider 'string uncPath = "//some_path";' or 'char operators[]="/*+-";' For language parsing, I think you're best off using a real parser.
  • nobody
    nobody over 15 years
    Also, as Alex Coventry mentions, simple regexes will hose string literals that happen to contain comment markers (which is perfectly legal).
  • Jonathan Leffler
    Jonathan Leffler over 15 years
    Your code doesn't handle abuse of comments, such as a backslash-newline in between the two start-of-comment symbols, or between the star-slash that ends a classic C-style comment. There's a strong sense in which it "doesn't matter; no-one in their right mind writes comments like that". YMMV.
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    @Jonathan: Wow, I didn't think this would compile. Redefines the meaning of “lexeme”. By the way, are there syntax highlighters (IDEs, code editors) that support this? Neither VIM nor Visual Studio do.
  • Brian
    Brian over 15 years
    This doesn't handle escaped " chars in strings. eg: char some_punctuation_chars=".\"/"; /* comment */
  • sedavidw
    sedavidw over 15 years
    Yes it does. \\. will match any escaped char, including \".
  • Pure Jobs Inc.
    Pure Jobs Inc. over 14 years
    "C (and C++) comments cannot be nested." Some compilers (well, at least Borland's (free) version 5.5.1) allow nested C-style comments via a command line switch.
  • atikat
    atikat about 14 years
    Also you can preserve line numbering relative to the input file by changing the first return to: return "" + "\n" * s.count('\n') I needed to do this in my situation.
  • Jonathan Leffler
    Jonathan Leffler almost 14 years
    I would also add that if anyone wrote a comment with the comment start or end symbols split over lines, I'd persuade them of the error of their ways. And extending a single-line comment with a trailing backslash is also similarly evil. So, the problems here are more imaginary than real - unless you're a C compiler writer.
  • frankster
    frankster almost 12 years
    good thinking, although its a shame it does more than just remove comments!
  • Stephen Niedzielski
    Stephen Niedzielski about 11 years
    Don't introduce an additional script and tool dependency to your Python script by using Sed. Choose Sed or Python, not both.
  • robocat
    robocat about 11 years
    That surely fails if there is a // or /* within a string, or within a / delimited regular expression.
  • robocat
    robocat about 11 years
    So I think it would fail on various RegExp strings (e.g. /\// or /\/*/ or /'/; //blah) and multiline strings (davidwalsh.name/multiline-javascript-strings). i.e. usable for simple code, but probably not for larger production codebases. If I had to use Python I would look for solutions using pynoceros or pynarcissus. If you can use node.js then UglifyJS2 is a good base for munging JavaScript code.
  • sedavidw
    sedavidw about 11 years
    @robocat True. But Regex literals are not part of the C language. If you wish to parse code with Regex literals, you could add this at the end of the Regex: |/(?:\\.|[^\\/])+/. The condition in the replacer() function would also have to be tweaked.
  • robocat
    robocat almost 11 years
    @markus-jarderot - Good point! I forgot it was C because I was looking for an ECMAScript solution! With C the regex can also fail on preprocessor statements (removing lines beginning with # is probably an easy fix for that issue though) so as it stands it doesn't solve "properly handles awkward cases". Also doesn't C have multiline strings using \ and does this handle those?
  • sedavidw
    sedavidw almost 11 years
    @robocat It does handle escapes, but not pre-processor statements. Neither does it handle Digraphs and trigraphs, but that is normally not a problem. For pre-processor statements, you could add |#[^\r\n]*(?:\\\r?\n[^\r\n]*)* at the end of the regex.
  • slottermoser
    slottermoser almost 11 years
    No it doesn't. It's looking for /** */ style java block comments, as stated in the description. It doesn't handle // or /* or even /... it isn't perfect, but it doesn't "fail", just ignores the cases you stated. It was just a reference for anyone looking for something similar.
  • sim642
    sim642 about 8 years
    This is the only response which doesn't involve an ugly hack.
  • Mark Smith
    Mark Smith about 6 years
    This fails for me (python2 and python3) on the simple string blah "blah" with error TypeError: sequence item 1: expected string, module found.
  • rfportilla
    rfportilla about 5 years
    Opening up another process is not good. It is expensive and risky. I suggest sticking with pure python.
  • tripleee
    tripleee almost 5 years
    But it also doesn't really answer the question.
  • Samuel Chen
    Samuel Chen over 4 years
    It's not python. It's shell. How if on window ?
  • Aman Deep
    Aman Deep almost 4 years
    It leaves a newline after removing a multi line comment. any fix for this?
  • sedavidw
    sedavidw almost 4 years
    @AmanDeep You could add [^\S\r\n]*\r?\n? after \*/ to include whitespace up until and including the following newline, if any.
  • Michael Donahue
    Michael Donahue almost 3 years
    It's been some time since this answer was posted, but I just wanted to say that I found it extremely useful. I've been experimenting with Thiago's solution above, but wanted to note that if you're parsing C code you may want to use the following import instead of the one leveraging pygments.lexers.c_like: from pygments.lexers.c_cpp import CLexer. I'm still experimenting with this, but using the former discarded pre-processor definitions for me.
  • Michael Donahue
    Michael Donahue almost 3 years