Remove text in-between delimiters in a string (using a regex?)

78,701

Solution 1

Simple regex would be:

string input = "Give [Me Some] Purple (And More) Elephants";
string regex = "(\\[.*\\])|(\".*\")|('.*')|(\\(.*\\))";
string output = Regex.Replace(input, regex, "");

As for doing it a custom way where you want to build up the regex you would just need to build up the parts:

('.*')  // example of the single quote check

Then have each individual regex part concatenated with an OR (the | in regex) as in my original example. Once you have your regex string built just run it once. The key is to get the regex into a single check because performing a many regex matches on one item and then iterating through a lot of items will probably see a significant decrease in performance.

In my first example that would take the place of the following line:

string input = "Give [Me Some] Purple (And More) Elephants";
string regex = "Your built up regex here";
string sOutput = Regex.Replace(input, regex, "");

I am sure someone will post a cool linq expression to build the regex based on an array of delimiter objects to match or something.

Solution 2

A simple way would be to do this:

string RemoveBetween(string s, char begin, char end)
{
    Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
    return regex.Replace(s, string.Empty);
}

string s = "Give [Me Some] Purple (And More) \\Elephants/ and .hats^";
s = RemoveBetween(s, '(', ')');
s = RemoveBetween(s, '[', ']');
s = RemoveBetween(s, '\\', '/');
s = RemoveBetween(s, '.', '^');

Changing the return statement to the following will avoid duplicate empty spaces:

return new Regex(" +").Replace(regex.Replace(s, string.Empty), " ");

The final result for this would be:

"Give Purple and "

Disclamer: A single regex would probably faster than this.

Solution 3

I have to add the old adage, "You have a problem and you want to use regular expressions. Now you have two problems."

I've come up with a quick regex that will hopefully help you in the direction you are looking:

[.]*(\(|\[|\"|').*(\]|\)|\"|')[.]*

The parenthesis, brackets, double quotes are escaped while the single quote is able to be left alone.

To put the above expression into English, I'm allowing for any number of characters before and any number after, matching the expression in between matching delimiters.

The open delimiter phrase is (\(|\[|\"|') This has a matching closing phrase. To make this a bit more extensible in the future, you could remove the actual delimiters and contain them in a config file, database or wherever you may choose.

Solution 4

Building on Bryan Menard's regular expression, I made an extension method which will also work for nested replacements like "[Test 1 [[Test2] Test3]] Hello World":

    /// <summary>
    /// Method used to remove the characters betweeen certain letters in a string. 
    /// </summary>
    /// <param name="rawString"></param>
    /// <param name="enter"></param>
    /// <param name="exit"></param>
    /// <returns></returns>
    public static string RemoveFragmentsBetween(this string rawString, char enter, char exit) 
    {
        if (rawString.Contains(enter) && rawString.Contains(exit))
        {
            int substringStartIndex = rawString.IndexOf(enter) + 1;
            int substringLength = rawString.LastIndexOf(exit) - substringStartIndex;

            if (substringLength > 0 && substringStartIndex > 0)
            {
                string substring = rawString.Substring(substringStartIndex, substringLength).RemoveFragmentsBetween(enter, exit);
                if (substring.Length != substringLength) // This would mean that letters have been removed
                {
                    rawString = rawString.Remove(substringStartIndex, substringLength).Insert(substringStartIndex, substring).Trim();
                }
            }

            //Source: https://stackoverflow.com/a/1359521/3407324
            Regex regex = new Regex(String.Format("\\{0}.*?\\{1}", enter, exit));
            return new Regex(" +").Replace(regex.Replace(rawString, string.Empty), " ").Trim(); // Removing duplicate and tailing/leading spaces
        }
        else
        {
            return rawString;
        }
    }

Usage of this method would in the suggested case look like this:

string testString = "[Test 1 [[Test2] Test3]] Hello World";
testString.RemoveFragmentsBetween('[',']');

Returning the string "Hello World".

Share:
78,701

Related videos on Youtube

p.campbell
Author by

p.campbell

Developer in the Microsoft .NET &amp; SQL Server stack. I am focused on delivering great applications in small iterations. I've developed solutions in marketing, healthcare, manufacturing, and transportation verticals. My open source projects on GitHub. Continuously learning.

Updated on November 24, 2020

Comments

  • p.campbell
    p.campbell over 3 years

    Consider the requirement to find a matched pair of set of characters, and remove any characters between them, as well as those characters/delimiters.

    Here are the sets of delimiters:

     []    square brackets
     ()    parentheses
     ""    double quotes
     ''    single quotes
    

    Here are some examples of strings that should match:

     Given:                       Results In:
    -------------------------------------------
     Hello "some" World           Hello World
     Give [Me Some] Purple        Give Purple
     Have Fifteen (Lunch Today)   Have Fifteen
     Have 'a good'day             Have day
    

    And some examples of strings that should not match:

     Does Not Match:
    ------------------
     Hello "world
     Brown]co[w
     Cheese'factory
    

    If the given string doesn't contain a matching set of delimiters, it isn't modified. The input string may have many matching pairs of delimiters. If a set of 2 delimiters are overlapping (i.e. he[llo "worl]d"), that'd be an edge case that we can ignore here.

    The algorithm would look something like this:

    string myInput = "Give [Me Some] Purple (And More) Elephants";
    string pattern; //some pattern
    string output = Regex.Replace(myInput, pattern, string.Empty);
    

    Question: How would you achieve this with C#? I am leaning towards a regex.

    Bonus: Are there easy ways of matching those start and end delimiters in constants or in a list of some kind? The solution I am looking for would be easy to change the delimiters in case the business analysts come up with new sets of delimiters.

  • James
    James over 14 years
    +1 regex seems to do what he needs. Just a simple regex.Replace is needed to round it off.
  • csharptest.net
    csharptest.net over 14 years
    bump for the "... Now you have two problems.", LOL
  • tymtam
    tymtam over 11 years
    This would now work as (most likely) expected for "Give [Me Some] Purple (And More) [Big] Elephants". This can be solved by using '.*?' instead of '.*' in the expression provided above.
  • Admin
    Admin over 11 years
    The OP included no mention of 'and hats.' "Give me purple and more elephants" was what OP explicitly requested. Why have you twisted his words and added hats to the equation?
  • Admin
    Admin over 10 years
    +1. Found myself back at this thread and didn't realize I'd posted the above comment! Poor attempt at humor. Thanks for your answer.
  • Bryan Menard
    Bryan Menard over 10 years
    Why hats?! I guess it's my own poor attempt at humor ;). Glad to see this is still useful.
  • Håkon Seljåsen
    Håkon Seljåsen over 7 years
    I like this approach, but it does not work if you have multiple layers of betweens, like this: "[[One string] another string]" which becomes " another string]"
  • Brent Oliver
    Brent Oliver over 5 years
    When I place this method into a page I get a warning that string does not contain definition for RemoveFragmentsBetween.
  • Håkon Seljåsen
    Håkon Seljåsen over 5 years
    I guess you have placed it in a not included namespace. Try googling "string does not contain definition for extension C#"
  • jing
    jing about 5 years
    It's not 100%. "[Test 1] [Test 2 [Test3]] Hello World".RemoveFragmentsBetween('[', ']') returns "] Hello World".
  • jing
    jing about 5 years
    Following solution seems to be more robust for nested parentheses: stackoverflow.com/a/14407908/86047