Why are URLs case-sensitive?

62,496

Solution 1

Why wouldn't the URL be case sensitive?

I understand that may look like a provocative (and "devil's advocate") type of rhetorical question, but I think it's useful to consider. The design of HTTP is that a "client", which we commonly call a "web browser", asks the "web server" for data.

There are many, many different web servers that are released. Microsoft has released IIS with Windows Server operating systems (and others, including Windows XP Professional). Unix has heavyweights like nginx and Apache, not to mention smaller offerings like OpenBSD's internal httpd, or thttpd, or lighttpd. Additionally, many network-capable devices have built in web servers that can be used to configure the device, including devices with purposes specific to networks, like routers (including many Wi-Fi access points, and DSL modems) and other devices like printers or UPSs (battery-backed uninterruptable power supply units) which may have network connectivity.

So the question, "Why are URLs case-sensitive?", is asking, "Why do the web servers treat the URL as being case sensitive?" And the actual answer is: they don't all do that. At least one web server, which is fairly popular, is typically NOT case sensitive. (The web server is IIS.)

A key reason for different behavior between different web servers probably boils down to a matter of simplicity. The simple way to make a web server is to do things the same way as how the computer/device's operating system locates files. Many times, web servers locate a file in order to provide a response. Unix was designed around higher end computers, and so Unix provided the desirable functionality of allowing uppercase and lowercase letters. Unix decided to treat uppercase and lowercase as different because, well, they are different. That's the straightforward, natural thing to do. Windows has a history of being case-insensitive due to a desire to support already-created software, and this history goes back to DOS which simply did not support lowercase letters, possibly in an effort to simplify things with less powerful computers that used less memory. Since these operating systems are different, the result is that simply-designed (early versions of) web servers reflect the same differences.

Now, with all that background, here are some specific answers to the specific questions:

When URLs were first designed, why was case-sensitivity made a feature?

Why not? If all standard web servers were case-insensitive, that would indicate that the web servers were following a set of rules specified by the standard. There was simply no rule that says that case needs to be ignored. The reason that there is no rule is simply that there was no reason for there to be such a rule. Why bother to make up unnecessary rules?

I ask this because it seems to me (i.e., a layperson) that case-insensitivity would be preferred to prevent needless errors and simplify an already complicated string of text.

URLs were designed for machines to process. Although a person can type a full URL into an address bar, that wasn't a major part of the intended design. The intended design is that people would follow ("click on") hyperlinks. If average laypeople are doing that, then they really don't care whether the invisible URL is simple or complicated.

Also, is there a real purpose/advantage to having a case-sensitive URL (as opposed to the vast majority of URLs that point to the same page no matter the capitalization)?

The fifth numbered point of William Hay's answer mentions one technical advantage: URLs can be an effective way for a web browser to send a bit of information to a web server, and more information can be included if there are less restrictions, so a case sensitivity restriction would reduce how much information can be included.

However, in many cases, there isn't a super compelling benefit to case sensitivity, which is proven by the fact that IIS typically doesn't bother with it.

In summary, the most compelling reason is likely just simplicity for those who designed the web server software, particularly on a case-sensitive platform like Unix. (HTTP wasn't something that influenced the original design of Unix, since Unix is notably older than HTTP.)

Solution 2

URLs are not case-sensitive, only parts of them.
For example, nothing is case-sensitive in the URL https://google.com,

With reference to RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

First, from Wikipedia, a URL looks like:

 scheme:[//host[:port]][/]path[?query][#fragment]
 

(I've removed the user:password part because it is not interesting and rarely used)

schemes are case-insensitive

The host subcomponent is case-insensitive.

The path component contains data...

The query component contains non-hierarchical data...

Individual media types may define their own restrictions on or structures within the fragment identifier syntax for specifying different types of subsets, views, or external references

So, the scheme and host are case-insensitive.
The rest of the URL is case-sensitive.

Why is the path case-sensitive?

This seems to be the main question.
It is difficult to answer "why" something was done if it was not documented, but we can make a very good guess.
I've picked very specific quotes from the spec, with emphasis on data.
Let's look at the URL again:

 scheme:[//host[:port]][/]path[?query][#fragment]
 \____________________/\________________________/
        Location                 Data
  • Location - The location has a canonical form, and is case-insensitive. Why? Probably so you could buy a domain name without having to buy thousands of variants.

  • Data - the data is used by the target server, and the application can choose what it means. It wouldn't make any sense to make data case insensitive. The application should have more options, and defining case-insensitivity in the spec will limit these options.
    This is also a useful distinction for HTTPS: the data is encrypted, but the host is visible.

Is it useful?

Case-sensitivity has its pitfalls when it comes to caching and canonical URLs, but it is certainly useful. Some examples:

Solution 3

Simple. The OS is case sensitive. Web servers generally do not care unless they have to hit the file system at some point. This is where Linux and other Unix-based operating systems enforce the rules of the file system in which case sensitivity is a major part. This is why IIS has never been case sensitive; because Windows was never case sensitive.

[Update]

There have been some strong arguments in the comments (since deleted) about whether URLs have any relationship with the file system as I have stated. These arguments have become heated. It is extremely short-sighted to believe that there is not a relationship. There absolutely is! Let me explain further.

Application programmers are not generally systems internals programmers. I am not being insulting. They are two separate disciplines and system internals knowledge is not required to write applications when applications can simply make calls to the OS. Since application programmers are not systems internals programmers, bypassing the OS services is not possible. I say this because these are two separate camps and they rarely cross-over. Applications are written to use OS services as a rule. There are rare some exceptions of course.

Back when web servers began to appear, application developers did not attempt to bypass OS services. There were several reasons for this. One, it was not necessary. Two, application programmers generally did not know how to bypass OS services. Three, most OSes were either extremely stable and robust, or extremely simple and light-weight and not worth the cost.

Keep in mind that the early web servers either ran on expensive computers such as the DEC VAX/VMS servers and the Unix of the day (Berkeley and Ultrix as well as others) on main-frame or mid-frame computers, then soon after on light-weight computers such as PCs and Windows 3.1. When more modern search engines began to appear, such as Google in 1997/8, Windows had moved into Windows NT and other OSes such as Novell and Linux had also began to run web servers. Apache was the dominant web server though there were others such as IIS and O'Reilly which were also very popular. None of them at the time bypassed OS services. It is likely that none of the web servers do even today.

Early web servers were quite simple. They still are today. Any request made for a resource via an HTTP request that exists on a hard-drive was/is made by the web server through the OS file system.

File systems are rather simple mechanisms. As a request is made for access to a file, if that file exists, the request is passed to the authorization sub-system and if granted, the original request is satisfied. If the resource does not exist or is not authorized, an exception is thrown by the system. When an application makes a request, a trigger is set and the application waits. When the request is answered, the trigger is thrown and the application processes the request response. It still works that way today. If the application sees that the request has been satisfied it continues, if it has failed, the application executes an error condition within it's code or dies if not handled. Simple.

In the case of a web server, assuming that a URL request for a path/file is made, the web server takes the path/file portion of the URL request (URI) and makes a request to the file system and it is either satisfied or throws an exception. The web server then processes the response. If, for example, the path and file requested is found and access granted by the authorization sub-system, then the web server processes that I/O request as normal. If the file system throws an exception, then the web server returns a 404 error if the file is Not Found or a 403 Forbidden if the reason code is unauthorized.

Since some OSes are case sensitive and file systems of this type require exact matches, the path/file that is requested of the web server must match what exists on the hard drive exactly. The reason for this is simple. Web servers do not guess what you mean. No computer does so without being programmed to. Web servers simply process requests as they receive them. If the path/file portion of the URL request being passed directly to the file system does not match what is on the hard drive, then the file system throws an exception and the web server returns a 404 Not Found error.

It is really that simple folks. It is not rocket science. There is an absolute relationship between the path/file portion of a URL and the file system.

Solution 4

  1. URLs claim to be a UNIFORM Resource locator and can point to resources that predate the web. Some of these are case sensitive (eg many ftp servers) and URLs need to be able to represent these resources in a reasonably intuitive fashion.

  2. Case insensitivity requires more work when looking for a match (either in the OS or above it).

  3. If you define URLs as case sensitive individual servers can implement them as case insensitive if they want. The reverse is not true.

  4. Case insensitivity can be non-trivial in international contexts: https://en.wikipedia.org/wiki/Dotted_and_dotless_I . Also RFC1738 allowed for the use of characters outside the ASCII range provided they were encoded but didn't specify a charset. This is fairly important for something calling itself the WORLD wide web. Defining URLs as case insensitive would open up a lot of scope for bugs.

  5. If you are trying to pack a lot of data into a URI (eg a Data URI) you can pack more in if upper and lower case are distinct.

Solution 5

I stole from the blog an Old New Thing the habit of approaching questions of the form "why is it that something is the case?" with the counter-question "what would the world be like, if it were not the case?"

Say I set up a web server to serve myself my document files from a folder so I could read them on the phone when I was out the office. Now, in my documents folder, I have three files, todo.txt, ToDo.txt and TODO.TXT (I know, but it made sense to me when I made the files).

What URL would I like to be able to use, to access these files? I would like to access them in an intuitive way, using http://www.example.com/docs/filename.

Say I have a script which lets me add a contact to my addressbook, which I can also do over the web. How should that take its parameters? Well, I'd like to use it like: http://www.example.com/addcontact.php?name=Tom McHenry von der O'Reilly. But if there were no way for me to specify the name by case, how would I do that?

How would I differentiate the wiki pages for Cat and CAT, Text and TEXT, latex and LaTeX? Disambig pages, I guess, but I prefer just getting the thing I asked for.

But all that feels like it's answering the wrong question, anyway.

The question I think you were really asking is "Why do web servers 404 you just for a case difference, when they are computers, designed to make life simpler, and they are perfectly capable of finding at least the most obvious case-variations in the URL I typed that would work?"

The answer to which is that while some sites have done this (and better, they check for other typos too), nobody's thought it worthwhile to change a webserver's default 404 error page to do that... but maybe they should?

Share:
62,496

Related videos on Youtube

Kyle
Author by

Kyle

Updated on September 18, 2022

Comments

  • Kyle
    Kyle almost 2 years

    My question: When URLs were first designed, why was case-sensitivity made a feature? I ask this because it seems to me (i.e., a layperson) that case-insensitivity would be preferred to prevent needless errors and simplify an already complicated string of text.

    Also, is there a real purpose/advantage to having a case-sensitive URL (as opposed to the vast majority of URLs that point to the same page no matter the capitalization)?

    Wikipedia, for example, is a website that is sensitive to letter case (except for the first character):

    https://en.wikipedia.org/wiki/StAck_Exchange is DOA.

    • John Conde
      John Conde over 8 years
      You obviously don't run IIS on Windows
    • paj28
      paj28 over 8 years
      There are workarounds, e.g. mod_speling
    • Eric Towers
      Eric Towers over 8 years
      I imagine that itscrap.com, expertsexchange, and whorepresents.com would prefer that more people used case-sensitive names. For more, see boredpanda.com/worst-domain-names .
    • CodesInChaos
      CodesInChaos over 8 years
      Since wikipedia isn't backed by a file system, the designers could have easily chosen to use case insensitive urls if they wanted.
    • user3109504
      user3109504 over 8 years
      URL's were designed when dinosaurs rendered on Unix systems roamed the Earth, and Unix is case sensitive.
    • MrWhite
      MrWhite over 8 years
      Wikipedia tries to use the correct capitalisation for the subject title and uses redirects for common differences. eg. html, htm and Html all redirect to HTML. But importantly, because of the enormous subject matter, it's possible to have more than one page where the URL only differs by case. For example: Latex and LaTeX
    • edc65
      edc65 over 8 years
      Your premise is wrong as url are not case sensitive - see @Koby answer
    • MrWhite
      MrWhite over 8 years
      @edc65 But Kobi states that parts of the URL (notably the path) are case-sensitive - so, doesn't that make the URL (as a whole) case-sensitive?
    • hBy2Py
      hBy2Py over 8 years
      @w3dk If you take the URL monolithically, then yes it's case-sensitive. But if you're typing a URL into a browser address bar, functionally, only parts of it are case sensitive. Navigating to HTTP://WWW.GOOGLE.COM will work just fine.
    • TripeHound
      TripeHound over 8 years
      @kasperd Or, perhaps worse, the Hungarian I, whose lowecase is ı and i whose uppercase is İ (see see Wiki).
    • Admin
      Admin over 8 years
      Just a thought... maybe in the Wild West days of the internet, some forward thinking fellow said to himself, "Wow, if this thing really takes off, there will be a need for lots of unique addresses one day. If we make it case sensitive it will accommodate more people." To date, there are around one billion unique websites.
    • Pharap
      Pharap over 8 years
      @w3dk Technically that last letter isn't a Latin letter X it's a Greek letter Χ (chi), so it should be more than just a matter of case,
    • Hagen von Eitzen
      Hagen von Eitzen over 8 years
      Short answer: Because "U" stands for "universal"
    • TripeHound
      TripeHound over 8 years
      @HagenvonEitzen Except it stands for "Uniform" (wiki)
  • user
    user over 8 years
    "Windows server was released after 2000." The Windows NT 3.1 team would have disagreed with you in 1993. NT 3.51 in 1995 was probably when NT started becoming mature and well-established enough to support business-critical server applications.
  • user3109504
    user3109504 over 8 years
    NT 3.51 had the Win 3.1 interface. Windows did not take off really until Windows 95 and it took NT 4.0 to get the same interface.
  • Mani
    Mani over 8 years
    Michael Kjörling, agreed. Let me modify it.
  • user
    user over 8 years
    @ThorbjørnRavnAndersen In the server market, NT 3.51 was reasonably successful. In the consumer/prosumer market, it took until Windows 2000 (NT 5.0) before the NT line started gaining serious traction.
  • MrWhite
    MrWhite over 8 years
    "Yes. to avoid duplicate content issues." - But the opposite would seem to be true? The fact that URLs can be case-sensitive (and this is how search engines treat them) causes the duplicate content issues you mention. If URLs were universally case-insensitive then there would be no duplicate content issues with differing case. page-1 would be the same as PAGE-1.
  • MrWhite
    MrWhite over 8 years
    "URLs are not case-sensitive." / "The rest of the URL is case-sensitive." - This would seem to be a contradiction?
  • O. Jones
    O. Jones over 8 years
    In truth, the scheme defines what to expect in the rest of the URL. http: and related schemes mean that the URL refers to a DNS hostname. DNS was ASCII case-insensitive long before the invention of URLs. See page 55 of ietf.org/rfc/rfc883.txt
  • MrWhite
    MrWhite over 8 years
    @OllieJones But any http: URLs are also likely to have additional path/data which is potentially case-sensitive. (?)
  • closetnoc
    closetnoc over 8 years
    Nicely detailed! I was going from a historical point of view. It was originally the file path that required to be case sensitive only if you were hitting the file system. Otherwise, it was not. But today, things have changed. For example, parameters and CGI did not exist originally. Your answer takes a current day perspective. I had to reward your efforts!! You really dug in on this one! Who knew this would blow-up the way it did?? Cheers!!
  • MrWhite
    MrWhite over 8 years
    "URLs are not case-sensitive, only parts of them." - If "parts" of the element are case-sensitive then the whole must also be case-sensitive. (?)
  • Mike -- No longer here
    Mike -- No longer here over 8 years
    I think a poor server configuration is what can cause duplicate content when it comes to casing. For example, the statement RewriteRule ^request-uri$ /targetscript.php [NC] stored in .htaccess would match http://example.com/request-uri and http://example.com/ReQuEsT-Uri because the [NC] indicates that casing doesn't matter when evaluating that one regular expression.
  • TOOGAM
    TOOGAM over 8 years
    Updated. Reviewed every case of "browsers" and made multiple replacements. Thank you for pointing this out so some quality could be improved.
  • Steve Jessop
    Steve Jessop over 8 years
    @w3dk: it's a not-very-interesting quirk of terminology, but you could take "case-sensitive" to mean, "changing the case of a character can change the whole", or you could take it to mean, "changing the case of a character always changes the whole". Kobi seems to be asserting the latter, he prefers that case-sensitive should mean "any change in case is significant", which of course is not true of URLs. You prefer the former. It's just a matter of how sensitive they are to case.
  • closetnoc
    closetnoc over 8 years
    Some sites use some kind of mechanism to convert any query to all lowercase or something consistent. In a way, this is smart.
  • William Hay
    William Hay over 8 years
    While only a subset of ASCII can be used unencoded in a URL RFC1738 specifically states characters outside the ASCII range may be used encoded. Without specifying a charset its isn't possible to know which octets represent the same character except for case. Updated.
  • rybo111
    rybo111 over 8 years
    @w3dk https://google.com/ is a URL and it is not case sensitive. Paths are optional, and they are case-sensitive. Much like water is not poisonous until poison is added :)
  • supercat
    supercat over 8 years
    @rybo111: If a user types example.com/fOObaR, the spec requires that the server at www.example.com receive a path "/fOObaR" as given; it is silent on the question of whether the server must treat that any differently from "/foOBaR".
  • SirNickity
    SirNickity over 8 years
    No, they shouldn't. This functionality can be, and often is, added in when it is desirable (e.g., by modules in apache.) To impose this kind of change as default behavior -- or worse, immutable behavior -- would be more disruptive than the relatively rare occasion where someone has to manually type in a URL beyond the host name. For a good example of why not to do this, recall the fiasco when Network Solutions "fixed" non-existent domain errors from public DNS queries.
  • Kevin
    Kevin over 8 years
    Re #4: It's actually worse than that. Dotted and dotless I are a demonstration of the more general principle that, even if everything is UTF-8 (or some other UTF), you cannot capitalize or lowercase correctly without knowing the locale to which the text belongs. In the default locale, a capital Latin letter I lowercases to a lowercase Latin letter i, which is wrong in Turkish because it adds a dot (there is no "Turkish capital dotless I" code point; you're meant to use the ASCII code point). Throw in encoding differences, and this goes from "really hard" to "completely intractable."
  • Dan Cieslak
    Dan Cieslak over 8 years
    @SirNickity Nobody was proposing immutability at any level and webserver error pages are configurable on every webserver I've ever used; nobody was suggesting replacing 404 with 30* codes, but rather adding a list of human-clickable suggestion links to the error page; domain names are a very different topic and issue being case-insensitive, and in a different security context; and IIS already automatically "fixes" (by ignoring) case-differences in the path or filename parts of URIs.
  • underscore_d
    underscore_d over 8 years
    actually explaining what is case-sensitive and why makes this superior to the higher-voted answer imho.
  • reinierpost
    reinierpost over 8 years
    Indeed, the WorldWideWeb was initially developed on Unix-based systems, which have case-sensitive file systems, and most URLs mapped directly to files on the file system.
  • reinierpost
    reinierpost over 8 years
    Since 1996, Apache has let you do this with mod_speling. It just don't seem to be a very popular thing to do. Unix/Linux people see case insensitivity as the rule, case insensitivity as the exception.
  • reinierpost
    reinierpost over 8 years
    End users were Unix users as well (not necessarily programmers, but high-energy physicists and the like), so they too were accustomed to case insensitivity.
  • William Hay
    William Hay over 8 years
    I think you argument is flawed. While Berners-Lee didn't have any choice about the case sensitivity of ftp URLs. He got to design http URLs. He could have specified them as US-ASCII only and case insensitive. If there ever were any web servers that just passed the URL path to the file system then they were insecure and the introduction of URL encoding broke compatibility with them. Given that the path is being processed before handing to the OS smashing case would have been easy to implement. Therefore I think we have to regard this as a design decision not an implementation quirk.
  • closetnoc
    closetnoc over 8 years
    @WilliamHay This has nothing to do with Berners-Lee or the design of the web. It is about limitations and requirements of the OS. I am a retired systems internals engineer. I worked on these systems at the time. I am telling you exactly why URLs are case sensitive. It is not a guess. It is not an opinion. It is a fact. My answer was intentionally simplified. Of course there are file checks and other processes that can be done prior to issuing any open statement. And Yes(!) web servers are partially insecure still to this day as a result.
  • William Hay
    William Hay over 8 years
    Whether URLs are case sensitive has nothing to do with the design of the web? Really? Argument from Authority followed by Argument by Assertion. That web servers pass the path component of a URL more or less directly to an open call is a consequence of the design of URLs not a cause of it. Servers (or smart clients in the case of FTP) could have hidden the case sensitivity of filesystems from the user. That they don't is a design decision.
  • closetnoc
    closetnoc over 8 years
    @WilliamHay You need to slow down grass hopper and reread what I have written. I am a retired systems internals engineer writing OS components, protocol stacks and router code for the ARPA-Net, etc. I worked with Apache, O'Reilly, and IIS internals. Your FTP argument does not hold water as at least the major FTP servers remain case sensitive for the same reason. At no time did I say anything about design of URL/URI. At no time did I say web servers passed values without processing. I did say that OS services are commonly used and that the file system requires an exact match to succeed.
  • closetnoc
    closetnoc over 8 years
    @WilliamHay Please understand that you and I are thinking at cross-purposes. All I was saying in my answer is that for some OSes, file system calls are case sensitive by design. Applications that use system calls, and most do, are limited to the enforcement of the OS rules - in this case, case sensitivity. It is not impossible to bypass this rule. In fact this may be somewhat trivial in some cases though not practical. I used to routinely bypass the file system in my work to unscramble hard drives that went kablooie for one reason or another or to analyze database file internals, etc.
  • Kyle
    Kyle over 8 years
    I have received several excellent answers to my question, ranging from the historical to the technical. I am hesitant to go against the grain and accept a lower-rated answer, but @TOOGAM's answer was the most helpful to me. This answer is thorough and extensive yet it explains the concept in an uncomplicated, conversational fashion that I can understand. And I think this answer is a good introduction to the more in-depth explanations.
  • RichardP
    RichardP almost 4 years
    The reason Windows has a case-insensitive filesystem is due to it's DOS heritage. MS-DOS started life on computers like the Tandy TRS-80, which used a TV as the display, and did not originally support lower-case letters due to the lack of resolution. Since it couldn't display lower-case, mixed-case wasn't supported. MS-DOS was licensed by IBM to become the original PC-DOS. While the original PC could display lower-case, the filesystem was ported over as-is from MS-DOS.