How to get all webpages on a domain

18,940

Solution 1

If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.

If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.

Solution 2

As you have said, you must follow all the links.

To do this, you must start by retrieving stackoverflow.com, easy: file_get_contents ("http:\\stackoverflow.com").

Then parse its contents, looking for links: <a href="question/ask">, not so easy.

You store those new URL's in a database and then parse that those after, which will give you a whole new set of URL's, parse those. Soon enough you'll have the vast majority of the site's content, including stuff like sub1.stackoverflow.com. This is called crawling, and it is quite simple to implement, although not so simple to retrieve useful information once you have all that data.

If you are only interested in one particular domain, be sure to dismiss links to external sites.

Solution 3

You would need to hack the server sorry.

What you can do is that, if you own the domain www.my-domain.com, you can put a PHP file there, that you use as a request on demand file. That php file you will need to code some sort of code in that can look at the Folders FTP Wise. PHP can connect to a FTP server, so thats a way to go :)

http://dk1.php.net/manual/en/book.ftp.php

You can with PHP read the dirs folders and return that as an array. Best i can do.

Share:
18,940
William The Dev
Author by

William The Dev

C, PHP, SQL(MySQL)

Updated on June 04, 2022

Comments