Regular Expression to Match All Comments in a T-SQL Script

16,400

Solution 1

This should work:

(--.*)|(((/\*)+?[\w\W]+?(\*/)+))

Solution 2

In PHP, i'm using this code to uncomment SQL (this is the commented version -> x modifier) :

trim( preg_replace( '@
(([\'"]).*?[^\\\]\2) # $1 : Skip single & double quoted expressions
|(                   # $3 : Match comments
    (?:\#|--).*?$    # - Single line comment
    |                # - Multi line (nested) comments
     /\*             #   . comment open marker
        (?: [^/*]    #   . non comment-marker characters
            |/(?!\*) #   . not a comment open
            |\*(?!/) #   . not a comment close
            |(?R)    #   . recursive case
        )*           #   . repeat eventually
    \*\/             #   . comment close marker
)\s*                 # Trim after comments
|(?<=;)\s+           # Trim after semi-colon
@msx', '$1', $sql ) );

Short version:

trim( preg_replace( '@(([\'"]).*?[^\\\]\2)|((?:\#|--).*?$|/\*(?:[^/*]|/(?!\*)|\*(?!/)|(?R))*\*\/)\s*|(?<=;)\s+@ms', '$1', $sql ) );

Solution 3

Using this code :

StringCollection resultList = new StringCollection(); 
try {
Regex regexObj = new Regex(@"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
} 
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

With the following input :

-- This is Comment 1
SELECT Foo FROM Bar
GO

-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO

/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'

/* This is a
multi-line comment */
DROP TABLE Bar

/* comment /* nesting */ of /* two */ levels supported */
foo...

Produces these matches :

-- This is Comment 1
-- This is
-- Comment 2
/* This is Comment 3 */
/* This is a
multi-line comment */
/* comment /* nesting */ of /* two */ levels supported */

Not that this will only match 2 levels of nested comments, although in my life I have never seen more than one level being used. Ever.

Solution 4

I made this function that removes all SQL comments, using plain regular expressons. It removes both line comments (even when there is not a linebreak after) and block comments (even if there are nested block comments). This function can also replace literals (useful if you are searching for something inside SQL procedures but you want to ignore strings).

My code was based on this answer (which is about C# comments), so I had to change line comments from "//" to "--", but more importantly I had to rewrite the block comments regex (using balancing groups) because SQL allows nested block comments, while C# doesn't.

Also, I have this "preservePositions" argument, which instead of stripping out the comments it just fills comments with whitespace. That's useful if you want to preserve the original position of each SQL command, in case you need to manipulate the original script while preserving original comments.

Regex everythingExceptNewLines = new Regex("[^\r\n]");
public string RemoveComments(string input, bool preservePositions, bool removeLiterals=false)
{
    //based on https://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689

    var lineComments = @"--(.*?)\r?\n";
    var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
    // literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
    // there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
    var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
    var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
    var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
    //var blockComments = @"/\*(.*?)\*/";  //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
    //so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
    var nestedBlockComments = @"/\*
                                (?>
                                /\*  (?<LEVEL>)      # On opening push level
                                | 
                                \*/ (?<-LEVEL>)     # On closing pop level
                                |
                                (?! /\* | \*/ ) . # Match any char unless the opening and closing strings   
                                )+                         # /* or */ in the lookahead string
                                (?(LEVEL)(?!))             # If level exists then fail
                                \*/";

    string noComments = Regex.Replace(input,
            nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
        me => {
            if (me.Value.StartsWith("/*") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
            else if (me.Value.StartsWith("/*") && !preservePositions)
                return "";
            else if (me.Value.StartsWith("--") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
            else if (me.Value.StartsWith("--") && !preservePositions)
                return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
            else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
                return me.Value; // do not remove object identifiers ever
            else if (!removeLiterals) // Keep the literal strings
                return me.Value;
            else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
            {
                var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
                return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
            }
            else if (removeLiterals && !preservePositions) // wrap completely all literals
                return "''";
            else
                throw new NotImplementedException();
        },
        RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
    return noComments;
}

Test 1 (first original, then removing comments, last removing comments/literals)

[select /* block comment */ top 1 'a' /* block comment /* nested block comment */*/ from  sys.tables --LineComment
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables --FinalLineComment]

[select                     top 1 'a'                                               from  sys.tables              
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables                   ]

[select                     top 1 ' '                                               from  sys.tables              
union
select top 1 '                                             ' from sys.tables                   ]

Test 2 (first original, then removing comments, last removing comments/literals)

Original:
[create table [/*] /* 
  -- huh? */
(
    "--
     --" integer identity, -- /*
    [*/] varchar(20) /* -- */
         default '*/ /* -- */' /* /* /* */ */ */
);
            go]


[create table [/*]    

(
    "--
     --" integer identity,      
    [*/] varchar(20)         
         default '*/ /* -- */'                  
);
            go]


[create table [/*]    

(
    "--
     --" integer identity,      
    [*/] varchar(20)         
         default '           '                  
);
            go]

Solution 5

This works for me:

(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))

It matches all comments starting with -- or enclosed within */ .. */ blocks

Share:
16,400
bopapa_1979
Author by

bopapa_1979

I've been writing software for a living since my sophomore year in High School. I like to think I'm good at it, mostly. I have also doubled as a systems administrator, entrepeneur, business developer, and sometimes as a CTO of small companies. While I'm always interested in the latest and slickest approach to crafting software, I try to stay pragmatic and tend to think about technology in terms of revenues. My technology strengths are in Database Design and Programming (not so much optimization) and middleware. I'm very good at turning business rules into decision trees, and like to apply rules from the bottom up. I'm into SQL Server (and other flavors of SQL), .Net, C#, and building Web Applications. I'm slowly jumping on the web 2.0 bandwagon, and JavaScript, which was the red-headed stepchild of languages when I cut my web design teeth, is my next target to become an expert at.

Updated on June 18, 2022

Comments

  • bopapa_1979
    bopapa_1979 about 2 years

    I need a Regular Expression to capture ALL comments in a block of T-SQL. The Expression will need to work with the .Net Regex Class.

    Let's say I have the following T-SQL:

    -- This is Comment 1
    SELECT Foo FROM Bar
    GO
    
    -- This is
    -- Comment 2
    UPDATE Bar SET Foo == 'Foo'
    GO
    
    /* This is Comment 3 */
    DELETE FROM Bar WHERE Foo = 'Foo'
    
    /* This is a
    multi-line comment */
    DROP TABLE Bar
    

    I need to capture all of the comments, including the multi-line ones, so that I can strip them out.

    EDIT: It would serve the same purpose to have an expression that takes everything BUT the comments.

  • FailedDev
    FailedDev over 12 years
    No it doesn't. It does not support nested comments as the OP stated.
  • bopapa_1979
    bopapa_1979 over 12 years
    @FailedDev - That statement was in regards to a question. It would be a nice-to-have, not really a requierement.
  • Martin Smith
    Martin Smith over 12 years
    As in your question regular expressions are not suitable for this. e.g. SELECT '/* This is not a comment */' FROM (SELECT 1 AS C) [/* nor is this */] is perfectly valid TSQL.
  • jcvegan
    jcvegan over 9 years
    This doesn't support breaklines
  • drizin
    drizin over 8 years
    See my answer below. It works with nested block comments. stackoverflow.com/a/33947706/3606250
  • drizin
    drizin over 8 years
    I tried this method and it works pretty well, however the performance is suffering. I compared my regex (see my answer below) to this Microsoft parser, and I achieved the same results (on more than 3 thousand scripts) but with a fraction of time.
  • David
    David over 8 years
    Taking this from your code: Regex rex = new Regex(@"/*(?>/*(?<LEVEL>)|*/(?<-LEVEL>)|(?! /* | */ ).)+(?(LEVEL)(?!))*/", RegexOptions.Singleline | RegexOptions.CultureInvariant); MatchCollection mcoll = rex.Matches(thestring, 0); Strips all nested block comments. Thank you. It's the only one I found that actually works as intended.
  • drizin
    drizin over 8 years
    David, I'm glad it worked. But please note that using all 6 regexs at the same time is important, because a block comment could be located inside a string literal (e.g.: select '/* this is a literal */' from sys.tables), or even inside a line comment. So my suggestion is to change the lambda function to extract exactly what you want.
  • David
    David over 8 years
    Doesn't seem to work with this string: /* Stuff with nested comments /* wooooo */ Wonder whut comes next /* woot 2*/ create proc in dis with encryption */ CREATE procedure dbo.Woiotninntu with encryption as begin select * from master end
  • David
    David over 8 years
    This works: Regex rex = new Regex(@"/*(?>/*(?<LEVEL>)|*/(?<-LEVEL>)|(?! /*|*/).)*(?(LEVEL)(?!))*/", RegexOptions.Singleline | RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace);
  • Greg
    Greg about 8 years
    Trying to understand this regex and I'm like \_(ツ)_/¯
  • Gayan
    Gayan almost 8 years
    This is very good one but when you have a comment like this /* exec Some_Proc '123456789c', 'A86A23CE-F068-450B-9356-A21197B7A715|CD7EF1CA-3247-492B-804‌​B-658C2473867D|B118E‌​687-6CBB-4438-A407-8‌​0DE93995885' */ it will remove everything but |
  • drizin
    drizin almost 8 years
    David, I've just tested now and it's working for me. Are you using the complete code?
  • drizin
    drizin almost 8 years
    @GayanRanasinghe, you are right. There was an error in my everythingExceptNewLines regex, it was actually NOT replacing the pipe character. I've just fixed the regex, and it´s working now. Thanks for the alert.
  • joebeeson
    joebeeson about 7 years
    If you have strings that contain what would otherwise be a SQL comment, this will match those as well unfortunately, e.g.: SELECT * FROM foo WHERE a = 'This has -- dashes'
  • Arin Taylor
    Arin Taylor almost 7 years
    @drizin can you consolidate the two regexes for single line comments by adding the RegexOptions.Multiline option as well and then only using the latter?
  • rbsdca
    rbsdca over 6 years
    I realize this is an old thread, but for anyone who got here via a search engine, I wanted to point out that I needed to escape the forward slash / to make this work, so this worked for me:. (--.*)|(((\/\*)+?[\w\W]+?(\*\/)+)) Test it Here
  • Jan Doggen
    Jan Doggen over 3 years
    I don't get the quotedidentifiers part here. Using RegExBuddy with either PCRE or C# syntax, (\""((\""\"")|[^""])*\"") does not match CREATE TABLE "select" ("identity" INT IDENTITY NOT NULL, "order" INT NOT NULL);
  • drizin
    drizin over 3 years
    @JanDoggen I'm not familiar with RegExBuddy but I assume it's just about quoting correctly (since my code above is C#). I've tried with Ultrapico Expresso the following regex ("(("")|[^"])*") and it worked fine - it could identify all your 3 quoted identifiers ("select","identity", "order"). But it doesn't match the whole DDL statement - maybe you're missing the purpose here? The purpose of the code above is to remove SQL comments, and in case it's important to identify boundaries of literals and identifiers so that their contents does not get removed incorrectly as comments.
  • Jan Doggen
    Jan Doggen over 3 years
    Ah, of course. In any RegEx tool or in my Delphi implemention I don't need to write "". Then it works fine ;-)
  • Dai
    Dai about 3 years
    This regex doesn't match a -- comment if it appears on the last-line of a string without a trailing \n.