Remove all "invisible" chars from a string?

13,518

Solution 1

The requirements are too fuzzy. Consider:

"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?

These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.

Define a simple grammar and take out most of the guesswork.

"{key}":"{value}",

Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.

Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).

Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.

Solution 2

I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:

static public string StripControlChars(this string s)
{
    return Regex.Replace(s, @"[^\x20-\x7F]", "");
}

Combined with the other RegEx examples already posted it should get you where you want to go.

Solution 3

If you use Regex (Regular Expressions) you can filter out all of that with one function.

string newVariable Regex.Replace(variable, @"\s", "");

That will remove whitespace, invisible chars, \n, and \r.

Solution 4

One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.

Regex.Replace(s, @"[^\x20-\x7F]", "")

should do that job.

Solution 5

var split = textLine.Split(":").Select(s => s.Trim()).ToArray();

The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.

Share:
13,518
Juan
Author by

Juan

I'm a software developer, currently working on my personal project http://www.heliumscraper.com.

Updated on July 21, 2022

Comments

  • Juan
    Juan almost 2 years

    I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:

    key1:value1
    key2:value2
    key3:value3
    ...
    

    This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?

    Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.

    Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.

    Note: I do NOT need any whitespaces at all, even inside a key or a value.

  • Richard Marskell - Drackir
    Richard Marskell - Drackir about 13 years
    Trim only removes whitespace at the beginning or end of strings, not ALL whitespace.
  • Dan Bryant
    Dan Bryant about 13 years
    @Drackir, yeah, just caught that; that raises the question of whether you really want to remove whitespace inserted in the middle of a key.
  • Paul Alexander
    Paul Alexander about 13 years
    This will remove spaced from the keys and values as well. You might just want to remove control characters like \t, \n, \r and double spaces.
  • Kyle Uithoven
    Kyle Uithoven about 13 years
    I believe he specifically said that he would like to deal with whitespace, tabs, as well as invisible characters, which includes control characters.
  • Richard Marskell - Drackir
    Richard Marskell - Drackir about 13 years
    Seems to me, the OP wants to remove all whitespace.
  • mgronber
    mgronber about 13 years
    Split(new[] { ':' }, 2) could be better.
  • Paul Alexander
    Paul Alexander about 13 years
    Control characters, yes, but a space may be a valid character in the value portion of the key/value pair. The OP doesn't specify, that's why it's just a comment to point out alternatives.
  • Juan
    Juan about 13 years
    I think that will work AFTER splitting each line. Can you give me a link to any documentation related to that particular regex?
  • Juan
    Juan about 13 years
    Never mind already found it here: mikesdotnetting.com/Article/46/…. And yes that's what I needed.
  • Juan
    Juan about 13 years
    Actually you are right. This would be my grammar: keys can be: "A-Za-z0-9", values can be: "A-Za-z0-9", key/value separator: ":", line separator: "\n". I think with that I can easily figure out some regular expressions to remove all unnecessary characters, perhaps by using the negation operator.
  • Benjamin Toueg
    Benjamin Toueg about 11 years
    It doesn't work with the "left-to-right mark" which is an invisible character fileformat.info/info/unicode/char/200e/index.htm