WebRequest "HEAD" light weight alternative

11,596

Solution 1

You'll have to clarify what you mean by "lightweight". What are you trying to accomplish?

Whether or not you can use GET/POST/HEAD/DELETE/etc will depend on the URL and what's configured in the application that is running on the server at that URL.

If all you're trying to do is see if you can make a connection without actually downloading the content you could maybe try just initiating a connection to port 80 using sockets, but there isn't really reliable or universally supported way just by changing the HTTP method.

Solution 2

Open the connection yourself with a socket (instead of an HttpRequest or WebClient), and close the stream as soon as you've read the status code. Fortunately the status code comes near the top of the response stream :)

Solution 3

If HEAD returns a 405, that means the server doesn't support HEAD (at least for that URL) and you'll have fall back to GET instead. The majority of sites should support HEAD, so you probably want to do HEAD by default, but if it throws a 405, you could maybe fall back to GET for that domain. Or maybe you want to try HEAD first for each request; YMMV.

If the server requires GET and you want to reduce network traffic, you could try doing a conditional GET and/or a partial GET (see e.g. RFC2616). I've never tried doing those with WebRequest but I think it lets you add custom outgoing HTTP headers, so you should be able to do it.

Also, don't forget that, if you're writing a spider (which you clearly are), you should respect the server's robots.txt, and it's also courteous to throttle your requests to something like one request every two seconds, so you don't slashdot the server.

Share:
11,596
Serapth
Author by

Serapth

National man of non-mystery.

Updated on June 11, 2022

Comments

  • Serapth
    Serapth almost 2 years

    I recently discovered that the following does not work with certain sites, such as IMDB.com.

    class Program
        {
            static void Main(string[] args)
            {
                try
                {
                    System.Net.WebRequest wc = System.Net.WebRequest.Create("http://www.imdb.com"); //args[0]);
    
                    ((HttpWebRequest)wc).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.2.153.1 Safari/525.19";
                    wc.Timeout = 1000;
                    wc.Method = "HEAD";
                    WebResponse res = wc.GetResponse();
                    var streamReader = new System.IO.StreamReader(res.GetResponseStream());
    
                    Console.WriteLine(streamReader.ReadToEnd());
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message);
                }
            }
        }
    

    It returns an HTTP 405 ( Method Not Allowed ). My problem is, I use code very similar to the above to check if a link is valid and the vast majority of times it works correctly. I can switch it to method equal GET and it works ( with an increase in timeout ), but this slows things down by an order of magnitude. I am assuming the 405 response is a server configuration on IMDB's server side.

    Is there a way for me to do the same thing as above, in a light weight manner in .NET? Or, is there a way to fix the above code so it works as a GET request that works with imdb?

    • Joe White
      Joe White about 13 years
      I had to increase the timeout, but the code you posted above works for me. Changing it to POST would make no sense, because you don't have any data to post. And your title talks about HEAD, but you're not doing a HEAD request. Please clarify what the question is, since your "broken" code works fine.
    • Serapth
      Serapth about 13 years
      Ug, really stupid typo in the title. Fixed now... classic example of think one thing and type another. When you run the above code, you aren't getting a 405 response? EDIT: Ok, realized even my code was flawed. The above is what I meant to post, and is edited to give the 405 error ( and make sense..... )
  • Serapth
    Serapth about 13 years
    Well essentially what I am using HEAD request for now are a) to check if a site actually exists b)if a site exists, for each link within, verify they actually exists ( therefore each image, style sheet, etc... ). Therefore, on some image heavy pages, it could literally be called hundreds of times. So, by lightweight I mean mostly network traffic.
  • Daniel Schaffer
    Daniel Schaffer about 13 years
    Right... the only more lightweight method I could think of in regards to bandwidth would be to use sockets to manually construct your HTTP requests, get back enough of the response to determine the HTTP status code, and then close the connection.
  • Serapth
    Serapth about 13 years
    Would going the route of hand crafted HTTP actually circumvent the 405 error results? EDIT: Er, status results I should have said, I suppose technically HTTP 405 isn't actually an error. It's only a handful of sites that are returning 405, and I don't actually know what part is causing that response. Right now, I am assuming its the HEAD request, but I am not sure.
  • Daniel Schaffer
    Daniel Schaffer about 13 years
    The HEAD request is what would be causing the issue. What I mean by the hand craft HTTP request is that you'd use a GET, which is what the server would expect, but since you'd be able to control what you download, you'd be able to download just the response headers and then terminate the connection before downloading the body.
  • Serapth
    Serapth about 13 years
    Thank you for the response. I'm not actually writing a spider, the end product is closer in nature to a web browser than anything else. I did as you suggested earlier ( HEAD request, then on 405 a full GET ), which is my current way of doing things but it is sub-optimal. I will look into partial GETs, that would probably be perfect. Thanks.