Clean the string? is there any better way of doing it?

37,531

Solution 1

OK, consider the following test:

public class CleanString
{
    //by MSDN http://msdn.microsoft.com/en-us/library/844skk0h(v=vs.71).aspx
    public static string UseRegex(string strIn)
    {
        // Replace invalid characters with empty strings.
        return Regex.Replace(strIn, @"[^\w\.@-]", "");
    }

    // by Paolo Tedesco
    public static String UseStringBuilder(string strIn)
    {
        const string removeChars = " ?&^$#@!()+-,:;<>’\'-_*";
        // specify capacity of StringBuilder to avoid resizing
        StringBuilder sb = new StringBuilder(strIn.Length);
        foreach (char x in strIn.Where(c => !removeChars.Contains(c)))
        {
            sb.Append(x);
        }
        return sb.ToString();
    }

    // by Paolo Tedesco, but using a HashSet
    public static String UseStringBuilderWithHashSet(string strIn)
    {
        var hashSet = new HashSet<char>(" ?&^$#@!()+-,:;<>’\'-_*");
        // specify capacity of StringBuilder to avoid resizing
        StringBuilder sb = new StringBuilder(strIn.Length);
        foreach (char x in strIn.Where(c => !hashSet.Contains(c)))
        {
            sb.Append(x);
        }
        return sb.ToString();
    }

    // by SteveDog
    public static string UseStringBuilderWithHashSet2(string dirtyString)
    {
        HashSet<char> removeChars = new HashSet<char>(" ?&^$#@!()+-,:;<>’\'-_*");
        StringBuilder result = new StringBuilder(dirtyString.Length);
        foreach (char c in dirtyString)
            if (removeChars.Contains(c))
                result.Append(c);
        return result.ToString();
    }

    // original by patel.milanb
    public static string UseReplace(string dirtyString)
    {
        string removeChars = " ?&^$#@!()+-,:;<>’\'-_*";
        string result = dirtyString;

        foreach (char c in removeChars)
        {
            result = result.Replace(c.ToString(), string.Empty);
        }

        return result;
    }

    // by L.B
    public static string UseWhere(string dirtyString)
    {
        return new String(dirtyString.Where(Char.IsLetterOrDigit).ToArray());
    }
}

static class Program
{
    /// <summary>
    /// The main entry point for the application.
    /// </summary>
    [STAThread]
    static void Main()
    {
        var dirtyString = "sdfdf.dsf8908()=(=(sadfJJLef@ssyd€sdöf////fj()=/§(§&/(\"&sdfdf.dsf8908()=(=(sadfJJLef@ssyd€sdöf////fj()=/§(§&/(\"&sdfdf.dsf8908()=(=(sadfJJLef@ssyd€sdöf";
        var sw = new Stopwatch();

        var iterations = 50000;

        sw.Start();
        for (var i = 0; i < iterations; i++)
            CleanString.<SomeMethod>(dirtyString);
        sw.Stop();
        Debug.WriteLine("CleanString.<SomeMethod>: " + sw.ElapsedMilliseconds.ToString());
        sw.Reset();

        ....
        <repeat>
        ....       
    }
}

Output

CleanString.UseReplace: 791
CleanString.UseStringBuilder: 2805
CleanString.UseStringBuilderWithHashSet: 521
CleanString.UseStringBuilderWithHashSet2: 331
CleanString.UseRegex: 1700
CleanString.UseWhere: 233

Conclusion

Does probably not matter which method you use.

The difference in time between the fasted (UseWhere: 233ms) and the slowest (UseStringBuilder: 2805ms) method is 2572ms when called 50000(!) times in a row. You should probably not need to care about it if don't run the method that often.

But if you do, use the UseWhere method (written by L.B); but also note that it is slightly different.

Solution 2

If it's purely speed and efficiency you are after, I would recommend doing something like this:

public static string CleanString(string dirtyString)
{
    HashSet<char> removeChars = new HashSet<char>(" ?&^$#@!()+-,:;<>’\'-_*");
    StringBuilder result = new StringBuilder(dirtyString.Length);
    foreach (char c in dirtyString)
        if (!removeChars.Contains(c)) // prevent dirty chars
            result.Append(c);
    return result.ToString();
}

RegEx is certainly an elegant solution, but it adds extra overhead. By specifying the starting length of the string builder, it will only need to allocate the memory once (and a second time for the ToString at the end). This will cut down on memory usage and increase the speed, especially on longer strings.

However, as L.B. said, if you are using this to properly encode text that is bound for HTML output, you should be using HttpUtility.HtmlEncode instead of doing it yourself.

Solution 3

use regex [?&^$#@!()+-,:;<>’\'-_*] for replacing with empty string

Solution 4

This one is even faster!
use:

string dirty=@"tfgtf$@$%gttg%$% 664%$";
string clean = dirty.Clean();


    public static string Clean(this String name)
    {
        var namearray = new Char[name.Length];

        var newIndex = 0;
        for (var index = 0; index < namearray.Length; index++)
        {
            var letter = (Int32)name[index];

            if (!((letter > 96 && letter < 123) || (letter > 64 && letter < 91) || (letter > 47 && letter < 58)))
                continue;

            namearray[newIndex] = (Char)letter;
            ++newIndex;
        }

        return new String(namearray).TrimEnd();
    }

Solution 5

I don't know if, performance-wise, using a Regex or LINQ would be an improvement.
Something that could be useful, would be to create the new string with a StringBuilder instead of using string.Replace each time:

using System.Linq;
using System.Text;

static class Program {
    static void Main(string[] args) {
        const string removeChars = " ?&^$#@!()+-,:;<>’\'-_*";
        string result = "x&y(z)";
        // specify capacity of StringBuilder to avoid resizing
        StringBuilder sb = new StringBuilder(result.Length);
        foreach (char x in result.Where(c => !removeChars.Contains(c))) {
            sb.Append(x);
        }
        result = sb.ToString();
    }
}
Share:
37,531

Related videos on Youtube

patel.milanb
Author by

patel.milanb

Updated on August 21, 2020

Comments

  • patel.milanb
    patel.milanb over 3 years

    I am using this method to clean the string

    public static string CleanString(string dirtyString)
    {
        string removeChars = " ?&^$#@!()+-,:;<>’\'-_*";
        string result = dirtyString;
    
        foreach (char c in removeChars)
        {
            result = result.Replace(c.ToString(), string.Empty);
        }
    
        return result;
    }
    

    This method works fine.. BUT there is a performance glitch in this method. everytime i pass the string, every character goes in loop, if i have a large string then it would take too much time to return the object.

    Is there any other better way of doing the same thing?. like in LINQ or JQUERY / Javascript

    Any suggestion would be appreciated.

    • Russ Cam
      Russ Cam almost 12 years
      For what purpose are you "cleaning" a string?
    • patel.milanb
      patel.milanb almost 12 years
      i am basically dealing it with a lot of Qurystring values...
    • akhil
      akhil almost 12 years
      you just want to make a string null or what?
    • nhahtdh
      nhahtdh almost 12 years
      Put all characters in a character class of regex, then replace all at once.
    • Furqan Hameedi
      Furqan Hameedi almost 12 years
      explore System.Text.RegularExpression namespace for this
    • Stuart.Sklinar
      Stuart.Sklinar almost 12 years
      Could this be done with RegEx?
    • hatchet - done with SOverflow
      hatchet - done with SOverflow almost 12 years
      Define "better". Any solution will have a loop over the characters. The drawback in your code is excess creation of string objects, not the loop over every character.
    • Mark Peters
      Mark Peters almost 12 years
      I'm a little concerned about you "cleaning" a query string. Can you describe what you are doing with the cleaned string?
    • patel.milanb
      patel.milanb almost 12 years
      so what do you suggest, which string objects i can remove?
    • patel.milanb
      patel.milanb almost 12 years
      there are values in querystring on which i have to build up <a href> tag...there are some cases in which i have values comeing from the database with the html tags included and want to show them on pages.
    • Security Hound
      Security Hound almost 12 years
      @patel.milanb - If you are using this to connect to a SQL database then your doing it wrong.
    • L.B
      L.B almost 12 years
      @patel.milanb Then what you are looking for is HttpUtility.HtmlEncode not string cleaning
  • patel.milanb
    patel.milanb almost 12 years
    this certainly helps. opens up a new idea for me using the StringBuilder class
  • L.B
    L.B almost 12 years
    removeChars.Contains is O(n). A HashSet would be better.
  • L.B
    L.B almost 12 years
    removeChars.IndexOf is O(n) operation . A HashSet would be better.
  • sloth
    sloth almost 12 years
    output should be result. Also you can omit .ToCharArray(), since a string implements IEnumerable<char>.
  • Steven Doggart
    Steven Doggart almost 12 years
    Grrr.. Thanks @BigYellowCactus. Don't know how I missed that.
  • L.B
    L.B almost 12 years
    You can also use a one-liner return new String(dirtyString.Where(c => !removeChars.Contains(c)).ToArray());
  • L.B
    L.B almost 12 years
    What would this give return new String(dirtyString.Where(Char.IsLetterOrDigit).ToArray()) at your machine?
  • sloth
    sloth almost 12 years
    It's fast. 50000 iterations: 182ms (next one is UseStringBuilderWithHashSet2 with 266ms)
  • Guillaume Beauvois
    Guillaume Beauvois almost 9 years
    Just for the reccords, for UseStringBuilderWithHashSet and UseStringBuilderWithHashSet2 the test will be if (!removeChars.Contains(c))
  • Evaldas Raisutis
    Evaldas Raisutis over 8 years
    how would you add white space to removeChars hashet?
  • Steven Doggart
    Steven Doggart over 8 years
    @Qweick well, the space character is already included, but if there were any other white space characters that you wanted to include, you could just concatenate them to the string (e.g. "..." & vbTab).
  • Evaldas Raisutis
    Evaldas Raisutis over 8 years
    @StevenDoggart grrh, yes, thanks :) For some reason I assumed there had to be a symbol for that :))
  • ATutorMe
    ATutorMe over 7 years
    Can L.B's UseWhere method be extended to allow additional characters? Like this: public static string UseWhereExtended(string dirtyString) { IEnumerable<char> stringQuery = from ch in dirtyString where char.IsLetterOrDigit(ch) || ch == '.' || ch == ',' || ch == '\'' || ch == '\"' || ch == '?' || ch == '!' select ch; return new string(stringQuery.ToArray()); }
  • Daxtron2
    Daxtron2 almost 6 years
    I think there's an error in UseStringBuilderWithHashSet2, shouldn't if(removeChars.Contains(c)) be if(!removeChars.Contains(c))?