When webservers send a page, why don't they send all required CSS, JS, and images without being asked?

performance webserver http resources

11,849

Solution 1

The short answer is "Because HTTP wasn't designed for it".

Tim Berners-Lee did not design an efficient and extensible network protocol. His one design goal was simplicity. (The professor of my networking class in college said that he should have left the job to the professionals.) The problem that you outline is just one of the many problems with the HTTP protocol. In its original form:

There was no protocol version, just a request for a resource
There were no headers
Each request required a new TCP connection
There was no compression

The protocol was later revised to address many of these problems:

The requests were versioned, now requests look like GET /foo.html HTTP/1.1
Headers were added for meta information with both the request and response
Connections were allowed to be reused with Connection: keep-alive
Chunked responses were introduced to allow connections to be reused even when the document size is not known ahead of time.
Gzip compression was added

At this point HTTP has been taken about as far as possible without breaking backwards compatibility.

You are not the first person to suggest that a page and all its resources should be pushed to the client. In fact, Google designed a protocol that can do so called SPDY.

Today both Chrome and Firefox can use SPDY instead of HTTP to servers that support it. From the SPDY website, its main features compared to HTTP are:

SPDY allows client and server to compress request and response headers, which cuts down on bandwidth usage when the similar headers (e.g. cookies) are sent over and over for multiple requests.

SPDY allows multiple, simultaneously multiplexed requests over a single connection, saving on round trips between client and server, and preventing low-priority resources from blocking higher-priority requests.

SPDY allows the server to actively push resources to the client that it knows the client will need (e.g. JavaScript and CSS files) without waiting for the client to request them, allowing the server to make efficient use of unutilized bandwidth.

If you want to serve your website with SPDY to browsers that support it, you can do so. For example Apache has mod_spdy.

SPDY has become the basis for HTTP version 2 with server push technology.

Solution 2

Your web browser doesn't know about the additional resources until it downloads the web page (HTML) from the server, which contains the links to those resources.

You might be wondering, why doesn't the server just parse its own HTML and send all the additional resources to the web browser during the initial request for the web page? It's because the resources might be spread across multiple servers, and the web browser might not need all those resources since it already has some of them cached, or may not support them.

The web browser maintains a cache of resources so it does not have to download the same resources over and over from the servers that host them. When navigating different pages on a website that all use the same jQuery library, you don't want to download that library every time, just the first time.

So when the web browser gets a web page from the server, it checks what linked resources it DOESN'T already have in the cache, then makes additional HTTP requests for those resources. Pretty simple, very flexible and extensible.

A web browser can usually make two HTTP requests in parallel. This is not unlike AJAX - they are both asynchronous methods for loading web pages - asynchronous file loading and asynchronous content loading. With keep-alive, we can make several requests using one connection, and with pipelining we can make several requests without having to wait for responses. Both of these techniques are very fast because most overhead usually comes from opening/closing TCP connections:

keep-alive

pipelining

A bit of web history...

Web pages started as plain text email, with computer systems being engineered around this idea, forming a somewhat free-for-all communication platform; web servers were still proprietary at the time. Later, more layers were added to the "email spec" in the form of additional MIME types, such as images, styles, scripts, etc. After all, MIME stands for Multi-Purpose Internet Mail Extension. Sooner or later we had what is essentially multimedia email communication, standardized web servers, and web pages.

HTTP requires that data be transmitted in the context of email-like messages, although the data most often is not actually email.

As technology like this evolves, it needs to allow developers to progressively incorporate new features without breaking existing software. For example, when a new MIME type is added to the spec - let's say JPEG - it will take some time for web servers and web browsers to implement that. You don't just suddenly force JPEG into the spec and start sending it to all web browsers, you allow the web browser to request the resources that it supports, which keeps everyone happy and the technology moving forward. Does a screen reader need all the JPEGs on a web page? Probably not. Should you be forced to download a bunch of Javascript files if your device doesn't support Javascript? Probably not. Does Googlebot need to download all your Javascript files in order to index your site properly? Nope.

Source: I've developed an event-based web server like Node.js. It's called Rapid Server.

References:

Solution 3

Because they do not know what those resources are. The assets a web page requires are coded into the HTML. Only after a parser determines what those assets are can the y be requested by the user-agent.

Additionally, once those assets are known, they need to be served individually so the proper headers (i.e. content-type) can be served so the user-agent knows how to handle it.

Solution 4

Because, in your example, web server would always send CSS and images regardless if the client already has them, thus greatly wasting bandwidth (and thus making the connection slower, instead of faster by reducing latency, which was presumably your intention). Note that CSS, JavaScript and image files are usually sent with very long expire times for exactly that reason (as when you need to change them, you just change the file name to force new copy which will again get cached for a long time).

Now, you can try to work around that wasting of bandwidth by saying "OK, but client could indicate that it already has some of that resources, so server would not send it again". Something like:

GET /index.html HTTP/1.1
Host: www.example.com
If-None-Match: "686897696a7c876b7e"
Connection: Keep-Alive

GET /style.css HTTP/1.1
Host: www.example.com
If-None-Match: "70b26618ce2c246c71"

GET /image.png HTTP/1.1
Host: www.example.com
If-None-Match: "16d5b7c2e50e571a46"

And then get only the files that haven't changed get sent over one TCP connection (using HTTP pipelining over persistent connection). And guess what? It is how it already works (you could also use If-Modified-Since instead of If-None-Match).

But if you really want to reduce latency by wasting lots of bandwidth (as in your original request), you can do that today using standard HTTP/1.1 when designing your website. The reason most people don't do it is because they don't think it is worth it.

To do it, you do not need to have CSS or JavaScript's in separate file, you can include them in main HTML file by using <style> and <script> tags (you probably do not even need to do it manually, your template engine can probably do it automatically). You can even include images in the HTML file using data URI, like this:

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />

Of course, base64 encoding increases the bandwidth usage slightly, but if you don't care about wasted bandwidth, that should not be an issue.

Now, if you really cared, you could even make you web scripts smart enough to get best of both worlds: on first request (user does not have a cookie), send everything (CSS, JavaScript, images) embedded just in one single HTML file as described above, add a link rel="prefetch" tags for external copies of the files, and add a cookie. If the user already has a cookie (eg. he has visited before), then send him just a normal HTML with <img src="example.jpg">, <link rel="stylesheet" type="text/css" href="style.css"> etc.

So on first visit the browser would request just a single HTML file and get and show everything. Then it would (when idle) preload specified external CSS, JS, images. The next time user visits, browser would request and get only changed resources (probably just new HTML).

The extra CSS+JS+images data would only ever be sent twice, even if you clicked hundreds of times on the website. Much better than hundreds of times as your proposed solution suggested. And it would never (not on the first time, nor on the next times) use more than one latency-increasing round-trip.

Now, if that sounds like too much work, and you don't want to go with another protocol like SPDY, there are already modules like mod_pagespeed for Apache, which can automatically do some of that work for you (merging multiple CSS/JS files into one, auto-inlining small CSS and minifiying them, make small placeholder inlined images while waiting for originals to load, lazy loading images etc.) without requiring that you modify single line of your webpage.

Solution 5

HTTP2 is based on SPDY and does exactly what you suggest:

At a high level, HTTP/2:

is binary, instead of textual

is fully multiplexed, instead of ordered and blocking

can therefore use one connection for parallelism

uses header compression to reduce overhead

allows servers to “push” responses proactively into client caches

More is available on HTTP 2 Faq

View more solutions

11,849

Ahmed

Updated on September 18, 2022

Comments

Ahmed almost 2 years
When a web page contains a single CSS file and an image, why do browsers and servers waste time with this traditional time-consuming route:
1. browser sends an initial GET request for the webpage and waits for server response.
2. browser sends another GET request for the css file and waits for server response.
3. browser sends another GET request for the image file and waits for server response.
When instead they could use this short, direct, time-saving route?
1. Browser sends a GET request for a web page.
2. The web server responds with (index.html followed by style.css and image.jpg)
- closetnoc over 9 years
  
  Any request cannot be made until the web page is fetched of course. After that, requests are made in order as the HTML is read. But this does not mean that only one request is made at a time. In fact, several requests are made but sometimes there are dependencies between requests and some have to be resolved before the page can be properly painted. Browsers do sometimes pause as a request is satisfied before appearing to handle other responses making it appear that each request is handled one at a time. The reality is more on the browser side as they tend to be resource intensive.
- Corey Ogburn over 9 years
  
  I'm surprised nobody mentioned caching. If I already have that file I don't need it sent to me.
- Ahmed over 9 years
  
  @Corey: Well, in this case, your web browser can easily send a list of the cached resources -in a custom header- to the server in the initial GET request so the web server doesn't include these resources in its response.. (i.e. a header like this> Cached-Resources: `tyle.css; /images/image.jpg)
- Corey Ogburn over 9 years
  
  This list could be hundreds of things long. Although shorter than actually sending the files, it's still pretty far from an optimal solution.
- Ahmed over 9 years
  
  Actually, I have never visited a web page that has more than 100 unique resources..
- daredev over 9 years
  
  @AhmedElsoobky: the browser doesn't know what resources can be sent as cached-resources header without first retrieving the page itself. It also would be a privacy and security nightmare if retrieving a page tells the server that I have another page cached, which is possibly controlled by a different organization than the original page (a multi-tenants website).
- Ahmed over 9 years
  
  @Lie: We shouldn't consider the current behaviour, because it would be changed of course to fit the new situation.. So here is a workaround: The browser will mark that the resource "image.jpg" belongs to example.com/mypage.htm, so when I visit that particular URL, My web browser will be aware of the related resources..etc
- Ahmed over 9 years
  
  i.e. the web browser will only include "image.jpg" into the Cached-Resources header when I visit example.com/mypage.htm...
- daredev over 9 years
  
  @AhmedElsobky: If the browser have marked that image.jpg belongs to mypage.htm, then that means the browser had visited the page before and hadn't cleared its cache. At that point it's better to just make sure you set caching headers properly and the browser could use the cached version rather than bothering with silly headers that adds thousands of extra bytes to every requests. Also HTTP is agnostic of its Content Type, while HTML is largely agnostic of the specifics of its transport protocol, a cached-resource header would have to cross this layer to be of any use.
- Ahmed over 9 years
  
  @Lie: Well, Web developers can mark the resources that are deployed in many webpages as Global (i.e. <img src="image.jpg" type="global">), that way the web browser will know if a cached resource should be sent in the Cached-Resources header always or only with specific URL/s..
- Ahmed over 9 years
  
  Also, Consider the Dynamic Web pages that have fixed resources with dynamic content.. (These are just side-problems, There is always a workaround)
- daredev over 9 years
  
  If you're already crossing the layer anyway, you can already implement this with the Cookie header, just return the embedded resources embedded "data:" or, once it's out, use HTTP2 Push (note: HTTP2 Push solves a totally different problem than Websocket Push), or, if you are fine with non-standard but popular extension, with SPDY Push. It wouldn't make your life simpler if gathering to-be-cached resources is done automatically by the browser (it's a security nightmare for those that don't want it, which is the majority of users).
- njzk2 over 9 years
  
  @AhmedElsobky: this very page already contains 32 items. some of the caching here is related to elements specific to the question (image posted by @perry), other related to stackexchange (logos...), and some are even more global, such as JS hosted on other servers.
- Rich Bradshaw over 9 years
  
  No one has mentioned HTTP 2s server push, this does exactly what you suggest.
closetnoc over 9 years

Dang good and informed answer! Web browsers are serial by nature and requests can be made rather rapidly. One look at a log file will show that requests for resources are made rather quickly once the HTML has been parsed. It is what it is. Not a bad system, just not as code/resource efficient as it could be.
Aran Mulholland over 9 years

Especially if you use something like require.js. The browser only asks for what it needs. Imagine having to load everything all at once...
Jost over 9 years

Just for the record, SPDY is not the holy grail. It does some things well, but introduces other problems. Here is one article containing some points speaking agains SPDY.
msouth over 9 years

I highly recommend that anyone interested in this read the criticisms in @Jost 's link. It gives you a hint of the complexity involved in figuring out how to do a very commonly implemented thing not just incrementally better but rather so much better that everyone starts using it. It's easy to think of an improvement that makes things somewhat better for a relatively large subset of use cases. To make things better in such a way that everyone starts using your new protocol because it's so much better that it's worth the cost of changing is another matter entirely, and not easy to do.
epluribusunum over 9 years

This is the right answer, and one that most of the commenter's seem to be missing - in order for the server to pro-actively send the resources, it needs to know what they are, which means the server would have to parse the HTML.
Ahmed over 9 years

Well, Actually, We can take care of all those side-problems (things like: Cache, Content-Type header..etc), there are Workarounds to solve these problems. And as I suggested in the comments on the post above, We can use something like this header> Cached-Resources: image.jpg; style.css; to solve the caching problem.. (If you do have time, then you can take a look at the comments above..)
perry over 9 years

Yes that idea had crossed my mind before, but it's simply too much overhead for HTTP and it doesn't solve the fact that resources may be spread across multiple servers. Furthermore, I don't think your proposed time-saving method would actually save time because data is going to be sent as a stream no matter how you look at it, and with keep-alive, 100 simultaneous HTTP requests essentially becomes 1 request. The technology and capability you propose seems to already exist in a way. See en.wikipedia.org/wiki/HTTP_persistent_connection
Ahmed over 9 years

If we are going to develop a new protocol or to fix an already existed one, We can take care of all these problems one way or another! And the web server will parse files for only one time and then it can classify them depending on defined rules so that it can prioritize which files to send first..etc and the web server doesn't have to know what I am intended to do with those files, It just have to know what to send, When to do and depending on which rules.. (web bots and spiders aren't a problem, the behavior will be different with them -they have unique user-agent headers-..)
The Spooniest over 9 years

@AhmedElsobky: What you're talking about sounds more like a specific implementation than a network protocol. But it really does have to know what you intend to do with the files before it can determine what to send: otherwise it will inevitably send files that many users don't want. You can't trust User-Agent strings, so you can't use them to determine what the user's intent is.
MSalters over 9 years

@ShantnuTiwari: The usual comparison is with TCP/IP v4, which disproves your FUD about professionals. That's so good that people use NAT just so they can avoid TCP/IP v6. However, professionals prefer to be paid for their work. Nobody ordered HTTP. Mr Berners-Lee was paid out of the CERN budget and did it pretty much on the side (HTTP isn't exactly quantum physics)
perry over 9 years

I sense a bit of bias against HTTP here. As you said, HTTP/1.1 has persistent connections (keep-alive) which helps to directly address the problem and is now in use by many/most web servers and REST APIs today. Not sure why anyone would consider Tim Berners-Lee as something less than a professional...
daredev over 9 years

To be frank there are no qualified professionals back then. Nobody knows how to build a world wide web, because nobody had ever build one. The concept of hypermedia wasn't invented by Tim, he had experience with various local area hypermedia system ten years before he wrote the proposal for a "Information Management" to solve the problem of "losing information" at CERN.
David Meister over 9 years

But the question asks why the web server does not send the resources, not why the client can't ask for them at the same time. It's very easy to imagine a world where servers have a package of related assets that are all sent together that does not rely on parsing HTML to build the package.
perry over 9 years

No qualified professionals back then? That can't be right; you must mean no qualified WWW engineers? In a pioneering phase, nobody has a full spec or design doc sitting in front of them, telling them what to do, or a list of certifications that say you're going to be great at this job (actually they do, it's just not exactly WWW certification because that doesn't exist yet). Tim used his professional experience with past systems and tech to propose his vision and execute it for CERN. Quite the big deal.
supercat over 9 years

@perry: What would you think of the idea of an alternative to https:// for sending large publicly-distributed files that need to be authenticated but not kept confidential: include in the URL a hash of certain parts of a legitimate reply's header, which could in turn include either a signature or a hash of the data payload, and have browsers validate the received data against the header? Such a design would not only save some SSL handshake steps, but would more importantly allow caching proxies. Get the URL via an SSL link, and the data could be fed from anywhere.
el.pescado - нет войне over 9 years

I think this is the correct answer.
dhaupin over 9 years

Just throwing out another thought, HTML was originally designed for inline flags/modifiers/things. This is part of the reason for the difference between back then and now. Today, its more and more shared+team+open united assets. And @ShantnuTiwari that is funny as heck, agree with you :) I still prefer the slow-go pace as it is, at least for now in this turbulent era. It's the only thing we can "stand by", even if it's far more simplistic than we all wish for.
Shantnu Tiwari over 9 years

@dhaupin its funny because it's true. :) There are two types of people in the world: Those who build things, and those who sit in the background and criticise the builders. Why, I never would have done it like that.
daredev over 9 years

@dhaupin: the semantic HTML is actually closer in spirit to the original HTML before its standardization (ploughing through the Historical Archive is quite a fun wayback if you had the time). The original idea of WWW since its conception was and has always been focused on links, link relations, and intelligent agents that can derive knowledge from tagged texts to understand the web of information, as well as users (originally CERN researchers) participating to collaborate to build that knowledge base.
Nij over 9 years

@DavidMeister Because the server doesn't always know what the client wants - a webcrawler for a search engine might not care about the CSS/JS, and there are many other resources linked in a document beyond those - there's no need to send the entire RSS feed down in the package to a web browser (most of the content is probably in the HTML already), while a feed reader might just parse the <head> element looking for the RSS alternate links to find just that - the client could send a list of what it's interested in, but then it needs to know what's available and we're back at the beginning
David Meister over 9 years

@Zhaph-BenDuguid I'm talking about an alternative world, to highlight that the answer has as much to do with how the protocol works as anything else. Additionally, it may well be faster for a server to send all data at once even if it is unnecessary. You're essentially trading off latency issues against bandwidth usage.
Nij over 9 years

@DavidMeister Again that's a trade off that should really be made be the client, they are more likely to know what their local constraints are. If what you really want is alternate reality answers then your question should be "How can we modify servers and clients to minimise requests". However, you can already do this now: inline all your CSS and JS, base-64 encode your images and add those to the page, and loose out on the parallelization of requests to load your additional resources...