ISO-8859-1 to UTF8 in ASP.NET 2

18,756

Solution 1

You have this line of code:-

String sSentSearchText = postedValues[i];

The decoding of octets in the post has happen here.

The problem is that META http-equiv doesn't tell the server about the encoding.

You could just add RequestEncoding="ISO-8859-1" to the @Page directive and stop trying to fiddle around with the decoding yourself (since its already happened).

That doesn't help either. It seems you can only specify the Request encoding in the web.config.

Better would be to stop using ISO-8859-1 altogether and leave it with the default UTF-8 encoding. I can see no gain and only pain with using a restrictive encoding.

Edit

If it seems that changing the posting forms encoding is not a possibility then we seem to be left with no alternative than to handle the decoding ourselves. To that end include these two static methods in your receiving code-behind:-

private static NameValueCollection GetEncodedForm(System.IO.Stream stream, Encoding encoding)
{
    System.IO.StreamReader reader = new System.IO.StreamReader(stream, Encoding.ASCII);
    return GetEncodedForm(reader.ReadToEnd(), encoding);
}


private static NameValueCollection GetEncodedForm(string urlEncoded, Encoding encoding)
{
    NameValueCollection form = new NameValueCollection();
    string[] pairs = urlEncoded.Split("&".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

    foreach (string pair in pairs)
    {
        string[] pairItems = pair.Split("=".ToCharArray(), 2, StringSplitOptions.RemoveEmptyEntries);
        string name = HttpUtility.UrlDecode(pairItems[0], encoding);
        string value = (pairItems.Length > 1) ? HttpUtility.UrlDecode(pairItems[1], encoding) : null;
        form.Add(name, value);
    }
    return form;
}

Now instead of assigning:-

postedValues = Request.Form;

use:-

postValues = GetEncodedForm(Request.InputStream, Encoding.GetEncoding("ISO-8859-1"));

You can now remove the encoding marlarky from the rest of the code.

Solution 2

I think adding your encoding into web.config like that will probably solve your problem :

<configuration>
   <system.web>
      <globalization
           fileEncoding="iso-8859-1"
           requestEncoding="iso-8859-1"
           responseEncoding="iso-8859-1"
           culture="en-US"
           uiCulture="en-US"
        />
   </system.web>
</configuration>

Solution 3

We had the same problem that you have. The topic is not straight-forward at all.

The first tip is to set the Response encoding of the page that posts the data (usually the same page as the one that receives the data in .NET) to the desired form post encoding.

However, this is just a hint to the user's browser on how to interpret the characters sent from the server. The user might choose to override the encoding manually. And, if the user overrides the encoding of the page, the encoding of the data sent in the form is also changed (to whatever the user has set the encoding to).

There is a small trick, though. If you add a hidden field with the name _charset_ (notice the underscores) in your form, most browsers will fill out this form field with the name of the charset used when posting the form. This form field is also a part of the HTML5 specification.

So, you might think your're good to go, however, when in your page, ASP.NET has already urldecoded all parameters sent in to the form. So when you actually have the value in the _charset_ field, the value of the field containing Merkblätter is already decoded incorrectly by .NET.

You have two options:

  1. In the ASP.NET page in question, perform the parsing of the request string manually
  2. In Application_BeginRequest, in Global.asax, parse the request parameters manually, extracting the _charset_field. When you get the value, set Request.ContentEncoding to System.Text.Encoding.GetEncoding(<value of _charset_ field>). If you do this, you can read the value of the field containing Merkblätter as usual, no matter what charset the client sends the value in.

In either of the cases above, you need to manually read Request.InputStream, to fetch the form data. I would recommend setting the Response Encoding to UTF-8 to have the greatest number of options in which characters you accept, and then treating the special cases when the user has overridden the charset especially, as specified above.

Solution 4

Function urlDecode(input)
 inp = Replace(input,"/","%2F")
 set conn = Server.CreateObject("MSXML2.ServerXMLHTTP")
 conn.setOption(2) = SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
 conn.open "GET", "http://www.neoturk.net/urldecode.asp?url=" & inp, False
 conn.send ""
 urlDecode = conn.ResponseText
End Function

To speed this up, just create a table on your db for decoded and encoded urls and read them on global.asa application.on_start section. Later put them on the application object. Then put a check procedure for that application obj. in above function and IF decoded url not exists on app array, THEN request it one time from remote page (tip: urldecode.asp should be on different server see: http://support.microsoft.com/default.aspx?scid=kb;en-us;Q316451) and insert it to your db and append to application array object, ELSE return the function from the application obj.

This is the best method I have ever found. If anybody wants further details on application object, database operations etc. contact me via [email protected]

You can see above method successfully working at: lastiktestleri.com/Home

I also used, HeliconTech's ISAPI_Rewrite Lite version usage is simple: url = Request.ServerVariables("HTTP_X_REWRITE_URL") this will return the exact url directed to /404.asp

Share:
18,756
Gordon Thompson
Author by

Gordon Thompson

I work in many languages such as C/Java/C#/Perl currently cutting my teeth on ASP.NET but I'm sure that'll change in 10 minutes time :) Edit : Have discovered JQuery and realised that I can do stuff in a day using JQuery and WebServices that took me a week to do in ASP.NET. Very happy :)

Updated on June 04, 2022

Comments

  • Gordon Thompson
    Gordon Thompson almost 2 years

    We've got a page which posts data to our ASP.NET app in ISO-8859-1

    <head>
        <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
        <title>`Sample Search Invoker`</title>
    </head>
    <body>
    
    <form name="advancedform" method="post" action="SearchResults.aspx">
        <input class="field" name="SearchTextBox" type="text" />
        <input class="button" name="search" type="submit" value="Search &gt;" />
    </form>
    

    and in the code behind (SearchResults.aspx.cs)

    System.Collections.Specialized.NameValueCollection postedValues = Request.Form;
    String nextKey;
    for (int i = 0; i < postedValues.AllKeys.Length; i++)
    {
        nextKey = postedValues.AllKeys[i];
    
        if (nextKey.Substring(0, 2) != "__")
        {
            // Get basic search text
            if (nextKey.EndsWith(XAEConstants.CONTROL_SearchTextBox))
            {
                // Get search text value
                String sSentSearchText = postedValues[i];
    
                System.Text.Encoding iso88591 = System.Text.Encoding.GetEncoding("iso-8859-1");
                System.Text.Encoding utf8 = System.Text.Encoding.UTF8;
    
                byte[] abInput = iso88591.GetBytes(sSentSearchText);
    
                sSentSearchText = utf8.GetString(System.Text.Encoding.Convert(iso88591, utf8, abInput));
    
                this.SearchText = sSentSearchText.Replace('<', ' ').Replace('>',' ');
                this.PreviousSearchText.Value = this.SearchText;
            }
        }
    }
    

    When we pass through Merkblätter it gets pulled out of postedValues[i] as Merkbl�tter The raw string string is Merkbl%ufffdtter

    Any ideas?

  • AnthonyWJones
    AnthonyWJones almost 15 years
    "The form isn't posting the data as ISO-8859-1 at all." I don't think that is true,browsers use the Content-Type header of the received HTML to determine what encoding it will use to post the content of a form.
  • Gordon Thompson
    Gordon Thompson almost 15 years
    Hmm, how do I post the form as ISO-8859-1? Thanks for the comment on the Request.Form stuff, this is inherited code and it worked so I never looked into fixing it..
  • Gordon Thompson
    Gordon Thompson almost 15 years
    yeh, that is an option i had considered but there are other issues with doing that unfortunately...
  • Gordon Thompson
    Gordon Thompson almost 15 years
    setting the input page to be UTF-8 would be my ideal option; the form is embedded in a customer site however and they don't seem to want to change the encoding to UTF-8 so I'm investigating alternatives. Why is encoding such a ballache, i'd happily hunt down and have stern words with the people who came up with this mess if i had the resources :-)
  • AnthonyWJones
    AnthonyWJones almost 15 years
    Encoding isn't a problem in ASP.NET its very simple. Leave encoding alone, don't touch it, the default UTF-8 works fine.
  • Gordon Thompson
    Gordon Thompson almost 15 years
    in an ideal world i would be using UTF-8 but alas it's not that easy in this app....
  • Guffa
    Guffa almost 15 years
    Use accept-charset="ISO-8859-1" in the form tag to specify the encoding.
  • AnthonyWJones
    AnthonyWJones almost 15 years
    @Guffa: The problem is that the post is going as ISO-8859-1 already, even with this explicit accept-charset attribute the server still doesn't know what the encoding of the incoming request is. The data is sent as application/x-www-form-urlencoded which a) doesn't carry charset (because its application/* data) and b.) the only sensible value would be US-ASCII because thats the encoding used in url encoding.
  • AnthonyWJones
    AnthonyWJones almost 15 years
    Its what happens to the character octets during url decoding where things are getting messed up. The server assumes that once the %xx byte values are resolved the complete set of bytes for each name and value in the set be treated as UTF-8. The only place that this particular server behaviour can be modified is web.config (according Canavar I haven't checked that myself).
  • Guffa
    Guffa almost 15 years
    If the server is decoding the data as UTF-8, you should use that in your form: accept-charset="UTF-8".
  • David Ching
    David Ching about 10 years
    Why doesn't ASP.NET handle the charset field automatically and not make us write Application_BeginRequest code? Here's a link for charset.
  • David Ching
    David Ching about 10 years
    @Guffa - Thanks, setting accept-charset="UTF-8" solved the issue for me!