application/x-www-form-urlencoded and charset="utf-8"?

forms http post encoding utf-8

77,112

Solution 1

There is no charset parameter defined for this media type.
For the encoding guidelines, see https://url.spec.whatwg.org/#application/x-www-form-urlencoded .

The application/x-www-form-urlencoded standard implies UTF-8 and percent-encoding.

Though:

A legacy server-oriented implementation might have to support encodings other than UTF-8 as well as have special logic for tuples of which the name is _charset. Such logic is not described here as only UTF-8 is conforming.

Solution 2

Note: that in step 2 of the above link it says: "Otherwise, let the selected character encoding be UTF-8." (see:http://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm.)

I also, believe this seems to indicate that it's a best practice for User agents to use UTF-8?

http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars

Here's what it says: B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:

...

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

This procedure results in a syntactically legal URI (as defined in [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of the character encoding to which the HTML document carrying the URI may have been transcoded.

Note. Some older user agents trivially process URIs in HTML using the bytes of the character encoding in which the document was received. Some older HTML documents rely on this practice and break when transcoded. User agents that want to handle these older documents should, on receiving a URI containing characters outside the legal set, first use the conversion based on UTF-8. Only if the resulting URI does not resolve should they try constructing a URI based on the bytes of the character encoding in which the document was received.

Note. The same conversion based on UTF-8 should be applied to values of the name attribute for the A element.

77,112

Author by

ErikR

Updated on September 04, 2020

Comments

ErikR almost 4 years

Is it customary to omit ;charset="utf-8" when the Content-type is application/x-www-form-urlencoded?

In particular, when using accept-charset="utf-8" in a form tag, I would expect some indication that utf-8 is being used in the headers, but I'm not seeing any.

Here is my simple test in Chrome. The form page is:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
</head>
<body>
<form method="POST" action="printenv.cgi" accept-charset="utf-8">
Your name:
<input name="name" type="text" size="30">
</form>
</body>
</html>

And the headers for the generated request are:

POST /printenv.cgi HTTP/1.1
Host: ...:8000
Connection: keep-alive
Content-Length: 19
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Origin: http://...:8000
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Referer: http://...:8000/utf8-test.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8

What's the convention for specifying how the form parameter values are encoded?