How to encode the ampersand if it is not already encoded?

11,877

Solution 1

This should do a pretty good job:

text = Regex.Replace(text, @"
    # Match & that is not part of an HTML entity.
    &                  # Match literal &.
    (?!                # But only if it is NOT...
      \w+;             # an alphanumeric entity,
    | \#[0-9]+;        # or a decimal entity,
    | \#x[0-9A-F]+;    # or a hexadecimal entity.
    )                  # End negative lookahead.", 
    "&",
    RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

Solution 2

What you actually want to do, is first decode the string and then encode it again. Don't bother trying to patch an encoded string.

Any encoding is only worth its salt if it can be decoded easily, so reuse that logic to make your life easier. And your software less bug-prone.

Now, if you are unsure of whether the string is encoded or not - the problem will most certainly not be the string itself, but the ecosystem that produced the string. Where did you get it from? Who did it pass through before it got to you? Do you trust it?

If you really have to resort to creating a magic-fix-weird-data function, then consider building a table of "encodings" and their corresponding characters:

& -> &
€ -> €
&lt; -> <
// etc.

Then, first decode all encountered encodings according to the table and later reencode the whole string. Sure, you might get more efficient methods when fumbling without decoding first. But you won't be sane next year. And this is your carrier, right? You need to stay right in the head! You'll loose your mind if you try to be too clever. And you'll lose your job when you go mad. Sad things happen to people who let maintaining their hacks destroy their minds...

EDIT: Using the .NET library, of course, will save you from madness:

I just tested it, and it seems to have no problems with decoding strings with just ampersands in them. So, go ahead:

string magic(string encodedOrNot)
{
    var decoded = HttpUtility.HtmlDecode(encodedOrNot);
    return HttpUtility.HtmlEncode(decoded);
}

EDIT#2: It turns out, that the decoder HttpUtility.HtmlDecode will work for your purpose, but the encoder will not, since you don't want angle brackets (<, >) to be encoded. But writing an encoder is really easy:

define encoder(string decoded):
    result is a string-builder
    for character in decoded:
        if character in encoding-table:
           result.append(encoding-table[character])
        else:
           result.append(character)
    return result as string

Solution 3

with regex it can be done with negative lookahead.

&(?![^& ]+;)

test example here

Share:
11,877

Related videos on Youtube

Petras
Author by

Petras

Updated on June 04, 2022

Comments

  • Petras
    Petras almost 2 years

    I need a c# method to encode ampersands if they are not already encoded or part of another encoded epxression

    eg

    "tom & jill" should become "tom &amp; jill"
    
    
    "tom &amp; jill" should remain "tom &amp; jill"
    
    
    "tom &euro; jill" should remain "tom &euro; jill"
    
    
    "tom <&> jill" should become "tom <&amp;> jill"
    
    
    "tom &quot;&&quot; jill" should become "tom &quot;&amp;&quot; jill"
    
    • tripleee
      tripleee over 12 years
      Do you have a finite set of entity codes you want to avoid double-encoding? If not, how should &stackoverflow; be treated, should it be preserved or changed to &amp;stackoverflow;?
    • user772401
      user772401
      If this is for HTTP/HTML, why not use the OOTB stuff in the BCL instead of regular expressions?
  • Stephan
    Stephan over 12 years
    This approach would fail for the 4th of the examples above.
  • Stephan
    Stephan over 12 years
    @Jodrell: Because re-encoding the string would transform < and > into &lt; and &gt;, which is what the OP does not want.
  • Jodrell
    Jodrell over 12 years
    @Stephan, you are right, I hadn't seen the edit. The OP doesn't want HTML Encoding, it must be some other encoding where < is ignored. With the right decoder\encoder the method would work.
  • darkAsPitch
    darkAsPitch over 12 years
    ah. yes, that is a bit of a bummer with the <. Well, then OP will just have to write an encoder - that should be easy. The decoder can still be used (I checked that).
  • ridgerunner
    ridgerunner over 12 years
    Very clean and works 99.99%. Nice. But will miss the & in strings such as: "http://example.com/path?var1=val1&var2=val2_with_a_;_semico‌​lon_in_it".
  • Petras
    Petras over 12 years
    Exactly what we needed - this does the job.
  • Andrew
    Andrew over 7 years
    All these years and nobody noted the typo? You used HtmlUtility in three places instead of HttpUtility. :)