Is a colon `:` safe for friendly-URL use?

104,068

Solution 1

I recently wrote a URL encoder, so this is pretty fresh in my mind.

http://site/gwturl#user:45/comments

All the characters in the fragment part (user:45/comments) are perfectly legal for RFC 3986 URIs.

The relevant parts of the ABNF:

fragment      = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Apart from these restrictions, the fragment part has no defined structure beyond the one your application gives it. The scheme, http, only says that you don't send this part to the server.


EDIT:

D'oh!

Despite my assertions about the URI spec, irreputable provides the correct answer when he points out that the HTML 4 spec restricts element names/identifiers.

Note that identifier rules are changing in HTML 5. URI restrictions will still apply (at time of writing, there are some unresolved issues around HTML 5's use of URIs).

Solution 2

MediaWiki and other wiki engines use colons in their URLs to designate namespaces, with apparently no major problems.

eg http://en.wikipedia.org/wiki/Template:Welcome

Solution 3

In addition to McDowell's analysis on URI standard, remember also that the fragment must be valid HTML anchor name. According to http://www.w3.org/TR/html4/types.html#type-name

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

So you are in luck. ":" is explicitly allowed. And nobody should "%"-escape it, not only because "%" is illegal char there, but also because fragment must match anchor name char-by-char, therefore no agent should try to tamper with them in any way.

However you have to test it. Web standards are not strictly followed, sometimes the standards are conflicting. For example HTTP/1.1 RFC 2616 does not allow query string in the request URL, while HTML constructs one when submitting a form with GET method. Whichever implemented in the real world wins at the end of the day.

Solution 4

I wouldn't count on it. It'll likely get url encoded as %3A by many user-agents.

Solution 5

Google also uses colons.

In this specification, they use colons for the custom method names.

Share:
104,068
Nicole
Author by

Nicole

Full stack software engineer. I've mostly used Java, Python, C, Javascript, HTML/CSS, and SQL. I work remotely and live in Utah with my beautiful girlfriend Kerstin (with whom I try to talk about software engineering as little as possible) and our kids. I'm a passionate feminist and social change activist.

Updated on July 08, 2022

Comments

  • Nicole
    Nicole almost 2 years

    We are designing a URL system that will specify application sections as words separated by slashes. Specifically, this is in GWT, so the relevant parts of the URL will be in the hash (which will be interpreted by a controller layer on the client-side):

    http://site/gwturl#section1/section2
    

    Some sections may need additional attributes, which we'd like to specify with a :, so that the section parts of the URL are unambiguous. The code would split first on /, then on :, like this:

    http://site/gwturl#user:45/comments
    

    Of course, we are doing this for url-friendliness, so we'd like to make sure that none of these characters which will hold special meaning will be url-encoded by browsers, or any other system, and end up with a url like this:

    http://site/gwturl#user%3A45/comments <--- BAD
    

    Is using the colon in this way safe (by which I mean won't be automatically encoded) for browsers, bookmarking systems, even Javascript or Java code?

  • Asaph
    Asaph over 14 years
    @arbales: Yes. Some less compliant user-agents will leave non-compliant urls unadorned.
  • Veger
    Veger over 14 years
    Opera also keeps the semi-colon, but counting on such behavior is not a good thing to do
  • Gumbo
    Gumbo over 14 years
    Renesis is talking about the URL fragment and not the URL path.
  • Nicole
    Nicole over 14 years
    Wikipedia was one of my thoughts when writing this question. Is its use of colons technically invalid/unsafe then? I commonly see ( and ) in Wikipedia URLs encoded, but never the colon, which left me a bit confused.
  • Nicole
    Nicole over 14 years
    I think you are on to something, can you explain this a little further? Not sending this to the server is not an issue, as we are using GWT. I'm just not sure I'm clear on the syntax specified by the section you quoted.
  • Amit Patil
    Amit Patil over 14 years
    But : is a gen-delim, not a sub-delim.
  • Veger
    Veger over 14 years
    The semi-colon is legal for a pchar, so whether it is in sub-delim or gen-delim is not an issue
  • McDowell
    McDowell over 14 years
    @bobince - : is in pchar, which is in fragment, so : is allowed. @Renesis - Wikipedia has an article on ABNF en.wikipedia.org/wiki/ABNF You are basically looking at a list of allowed characters, where / means OR. I haven't done any GWT programming, so I don't know how it uses the fragment part of URIs.
  • Nicole
    Nicole over 14 years
    One last question -- do you have any insight into the real-world application of this specification? Does this mean browsers should/will ignore (skip the encoding of) the : in the fragment?
  • barrowc
    barrowc over 14 years
    The Wayback Machine has a : in many of its links - e.g. web.archive.org/web/20080822150704/http://stackoverflow.com
  • Noon Silk
    Noon Silk over 14 years
    It's important that people realise this is the correct answer; everyone else is saying it isn't valid, but it is after the '#' symbol, so it is.
  • McDowell
    McDowell over 14 years
    @Renesis - I had forgotten about the HTML 4 limitations - see this answer: stackoverflow.com/questions/2053132/…
  • Adam Lindberg
    Adam Lindberg over 13 years
    That page doesn't motivate why they're not safe. The referenced RFC2396 does not say it should be escaped either. Also, the converter script provided does not encode it (in Chrome 9 anyway).
  • Steven Collins
    Steven Collins over 9 years
    Most relevant answer. We all know that what's in the specs has little to do with reality in web development. You're not going to get a much better guarantee of "safety" than "one of the top 10 websites in the world does it".
  • ktamlyn
    ktamlyn almost 6 years
    Adam you are incorrect. It directly states what and why.
  • Martin James
    Martin James about 5 years
    @StevenCollins No more relevant than the answer given 3-years prior to this one that states exactly the same thing :)
  • CanadianGirl827x
    CanadianGirl827x almost 3 years
    This is an excellent answer. I upvoted it, but still wanted to stop in and let you know how much I like everything about it.
  • grmdgs
    grmdgs over 2 years
    Explanation from the article why colon's should be escaped. Seems to be a style argument. > URLs use some characters for special use in defining their syntax. When these characters are not used in their special role inside a URL, they need to be encoded.