Do I really need to encode '&' as '&'?

validation html utf-8 character-encoding

392,501

Solution 1

Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.

HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.

Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.

Please just escape your code. It will save you a lot of trouble in the future.

Solution 2

Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.

Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.

Compare the following: which is easier? Which is easier to bugger up?

Methodology 1

Write some content which includes ampersand characters.
Encode them all.

Methodology 2

(with a grain of salt, please ;) )

Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:

It is isolated, and as such unambiguously an ampersand. eg. volt & amp
> In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt
>. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt&amp
> Encode it.

Solution 3

HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "&copy=2" is still a problem, for example, since © is the copyright symbol.

However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.

Solution 4

I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:

Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.

Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.

Solution 5

Could you show us what your title actually is? When I submit

<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>

to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...

View more solutions

392,501

Haroldo

Updated on November 11, 2021

Comments

Haroldo over 2 years

I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.

http://validator.w3.org is giving me this:

& did not start a character reference. (& probably should have been escaped as &.)

Do I really need to do &?

I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.
- Matthew Wilson over 13 years
  
  The specs do not say so. The poster refers to HTML5 which does not require escaping of the ampersand in all scenarios.
- Richard JP Le Guen over 13 years
  
  This should be Community Wiki, as you're looking for opinions, and not being fussy about validation implies that there's no objective basis upon which to answer.
- Joachim Sauer over 13 years
  
  @Richard: really? While I don't agree that "validation doesn't matter", I see this as a very objective question: "does this break anything other than the spec?"
- Richard JP Le Guen over 13 years
  
  @Joachim Sauer - Your example is a good question... that's not what the question is though :P The exact words "I'm curious to hear people's opinions" even appear in the text!
- Joachim Sauer over 13 years
  
  @Richard: I disagree here. "Do I really need to do &?" and "[...] I'm curious to hear people's opinions on this and if it's important and why." (emphasis mine). Those two indicate that he's interested in factual information, but knows that much of this is open to at least some interpretation, so he asks for multiple opinions.
- Richard JP Le Guen over 13 years
  
  @Joachim Sauer - This is true. I acknowledge the validity of your opinion... but stand by my own as well ;)
- unixman83 about 12 years
  
  @YiJiang Current web browsers go to great lengths to understand the user. And so does Google. It's part of the Spec. Future web-browsers may be less forgiving. So it's always a good idea to check how Wikipedia does it, and copy them.
- jontro almost 12 years
  
  When xslt transforming xml to html it will not escape & as & in attribute values.
- Kzqai over 10 years
  
  @unixman83 That is a good approach: see how wikipedia does it
- User about 10 years
  
  Google itself uses & in href urls. View source on google.com or plus.google.com I tend to like to follow the example of major players on these questionable subjects
- rnevius almost 10 years
  
  Here's the w3 spec
- Yash about 8 years
  
  Reserved characters in HTML must be replaced with character entities. Test Example on this URL: var element = document.evaluate('//table[@class="w3-table-all notranslate"]/tbody/tr[5]/td', window.document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null ).singleNodeValue; console.log('HTML:', element.innerHTML); var JS = (element.innerHTML).replace('&', '&'); console.log(JS);
- doug65536 over 7 years
  
  The HTML spec says to accept crap input. Does that mean your site is "allowed" to be crap now? Close tags that need to be closed and escape things! Come on people.
- StackSlave over 4 years
  
  I personally escape &, if assigned via JavaScript element.innerHTML = '&' or assigned to HTML directly, but it's not going to cause HTML to be parsed incorrectly. What causes a problem is quotes and > and <. If assigning element.value = "This isn't a problem. '<' & '>' is okay too!", however you would want to do <input type='text' value='This isn't a problem. '<' & '>' is okay too!' />. You don't have to self close that or do the &. .innerHTML should be escaped like raw HTML. With JavaScript element.value = there is no need.
- RBT about 4 years
  
  Related post - What is &amp used for
Andreas Bonini over 13 years

No browser will ever "misinterpret" a & by itself. Every existing browser displays it as "&". Considering he explicitly asked for practical reason to do it, and that he stated that he doesn't care about validation..
Delan Azabani over 13 years

Yes. But morally, should we be relying on the leniency and "nice" error handling of browsers? Or should we just write correct code?
Andreas Bonini over 13 years

@Delan: while I try to make every page I write validate, I understand from reading his question that he doesn't care about "morally". He just cares if it works or not. They are two different philosophies and both have their pros and cons, and there is not a "correct" one. For example this website doesn't validate, and yet it's a great website.
Andreas Bonini over 13 years

Also, even if it was XHTML it wouldn't "break the parsing" unless the content type was set to application/xhtml+xml, which no one does because it's dumb that instead of gracefully handling an error the browser must quit. (That's why XHTML is being discontinued in favor of HTML 5)
Jon Hanna over 13 years

@Andreas, but browsers have enough bugs in how they interpret correct code, depending on them getting the right results when you send them meaningless markup is chancy. It may work today with that example, and then fail with the next example (say if the next example has a semi-colon somewhere after the &)
Andreas Bonini over 13 years

@Jon: I agree that it's in all cases better if your pages validate. I'm obviously not contesting that. The gray area is this: is it worth spending X hours of development time to make them validate, or is it better to take the slight risk that in the future, somehow, things may break? I personally think it's worth it, but I don't blame people who think it's not (such as Jeff Atwood) since it's such a gray area. One thing is certain: making pages validate costs money, and it's something important to consider.
Delan Azabani over 13 years

In this case, you are wrong. It doesn't take X hours or Y dollars to make it validate for this particular case. It's a simple case of preg_replace('/&/','&',$code);
Matthew Wilson over 13 years

Everyone seems to be talking about HTML5, but the original question states that HTML5 is in use. HTML5 explicitly allows an unescaped & in this situation, unless what follows & would normally expand to an entity (eg &copy=2 is problematic but &x=2 is fine).
Gumbo over 13 years

@Andreas Bonini: You’re wrong. At least Firefox and Opera follow the rules and will interpret the following correctly: <a href="http://www.google.com/search?q=foo&sect=bar">foo§=bar<‌/a>.
Jon Hanna over 13 years

Until you've spent the X hours of development time making them validate (X should really be < 1 in most cases) then you don't know why they aren't validating. If you've been paying even reasonable attention to the code in the meantime, then why do you suddenly have nonsense output? You're going to have to investigate to make sure you don't have a serious bug, and then it's 5secs to fix it anyway. One of the big advantages of keeping things valid is that things suddenly being invalid can rapidly flag a subtle bug that would be missed if everything output was gibberish.
igor over 13 years

Making pages validate doesn’t really cost any money at all—at least not if you’re creating new ones. Maintaining invalid ones if things break costs money.
Jon Hanna over 13 years

Gosh-darn it. I missed the HTML 5 bit in the question!
Andreas Bonini over 13 years

@Gumbo: I explicitly said a & by itself. In your example it's not by itself is it?
AakashM over 13 years

That's the HTML 4 spec you link to; from my reading of the (draft) HTML 5 spec, only ambiguous ampersands are disallowed. An ampersand followed by a space, for example, isn't ambiguous, and so (again by my reading) should be permitted - see my answer for markup that the HTML 5 validator accepts.
Gumbo over 13 years

The second case of amp&volt is ambiguous: Is &volt now an entity reference or not?
Matt over 13 years

I didn't downvote but, if I had to guess, I'd say you were downvoted because your answer (while intelligent) is a little bit of a mismatch with the question. He's not asking about escaping user input. He has control over the characters and is basically asking "If it does what I want, is it really important to follow the language spec to the letter?" I.e., he knows that there's a & because he put it in.
Joachim Sauer over 13 years

@Matt: I see, and that would be reasonable. I was just assuming that no one writes entirely static HTML pages any more and that pretty much all content is at least somewhat dynamic (usually based on some database content). Maybe that assumption should have been made explicit.
Gumbo over 13 years

@AakashM: I’m not sure, it sounded like that.
Alex Jasmin over 13 years

@Delan You say that HTML5 allow it unless it looks like a valid character reference. What do you mean by looks like exactly? Surely the standard is more precise than this.
Delan Azabani over 13 years

&copy=3 'looks' like a valid entity as © is defined. According to HTML5, this kind of thing definitely should be escaped. &asldfj=4 does not look like a defined reference, so it doesn't need to be, but should be escaped anyway for reasons I've stated above in my answer.
Joe Dargie over 13 years

It’s like quoting attribute values — you don’t have to, but you can’t go wrong if you do it all the time.
kevinji about 13 years

Yes, HTML5 has a different parser than previous HTML and XHTML parsers, and allows unescaped ampersands in certain situations.
Mathias Bynens over 12 years

@Gumbo The ampersand in amp&volt is not an ambiguous ampersand (as per the definition in the HTML spec). See mathiasbynens.be/notes/ambiguous-ampersands and mothereff.in/ampersands#amp%26volt.
Mathias Bynens over 12 years

As far as these examples go, this is nothing new in HTML5. Both <title>Dolce & Gabbana</title> and <p>Dolce & Gabbana</p> are valid HTML 2.0.
Oriol about 11 years

document.write should be avoided. See the warning box in w3.org/html/wg/drafts/html/master/dom.html#document.write%28‌%29
Patrick M over 10 years

Good point about document.write(). But the over all point Alex is making about writing to the document from script stands, imo. +1
Mathias Bynens over 10 years

&copy=2 is not as big of a problem as you may think. In attribute values (e.g. the href attribute), the &copy won’t be considered as a character reference for ©. Outside an attribute value, it would.
refaelio almost 10 years

With that being said, generally speaking, you must remember that most of the "standard" ways are still in draft mode and may change in the future.
Palec almost 8 years

Read the top-voted answer. Attributes are #PCDATA and therefore parsed. Entities are handled there. In your example, the & starts an entity reference. After reading &qux, the parser finds no final semicolon (;), but runs into an equals sign (=), which cannot be a part of entity name. This should be parse error, if the parser tried to be really strict (according to HTML 4). In HTML 5, entities parsing is overall more relaxed.
Demi over 7 years

I suspect that in general it is best to use ; as a separator in query strings (when you control the link) for that reason.
Carl Smith about 7 years

Given that an ampersand is normally preceded and followed by a space in English text, it's not difficult to remember or think about the rule I follow: If the ampersand is not touching another visible character, which is almost always, then it doesn't need encoding. Otherwise, just encode for simplicity's sake.
Ferrybig almost 6 years

Could you add a reference to the HTML5 rules?
Jacob C. over 4 years

@MathiasBynens By now (2019), the definition of an ambiguous ampersand seems to have changed a bit from the definition you quoted back in 2011 in mathiasbynens.be/notes/ambiguous-ampersands .
iJungleBoy over 2 years

I believe &copy= is never a problem, because Xml entities always have the structure &...; - they must end with a ; - otherwise it's not an Xml entity. Still I agree it's better to be safe than to over-optimize with risks.
Alec Jacobson about 2 years

why would it be better?