Tor Web Crawler

18,222

Solution 1

cURL also supports SOCKS connections; try this:

<?php

$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 

// SOCKS5
curl_setopt($ch, CURLOPT_PROXY, 'localhost:9050'); 
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);

curl_exec($ch); 
curl_close($ch);

Solution 2

Unless I'm missing something the answer is yes, and here is some documentation on the Tor site. The instructions are pretty specific. Though I've not set Tor up as a proxy it's something I've considered, this is the place I would start.

EDIT: It is dead simple to setup Tor on Linux and use it as a proxy as the documentation suggests.

sudo apt-get install tor
sudo /etc/init.d/tor start

netstat -ant | grep 9050 # verify Tor is running

Now after looking through OPs code we see calls to file_get_contents. While the easiest method to use at first file_get_contents becomes cumbersome when you want to start parametrizing the request because you have to use stream contexts.

First suggestion is to move to curl, but again, more reading on how SOCKS works w/ HTTP is probly in order to truly answer this question... But to answer the question technically, how to send an HTTP request to a Tor SOCKS proxy on localhost, again easy..

<?php  
$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 
curl_setopt($ch, CURLOPT_PROXY, 'https://127.0.01:9050/'); 
curl_exec($ch); 
curl_close($ch);

But what does Tor tell us?

HTTP/1.0 501 Tor is not an HTTP Proxy

Content-Type: text/html; charset=iso-8859-1

Basically, learn more about SOCKS & HTTP. Another option is to google around for PHP SOCKS clients. A quick inspection reveals a library that claims it can send HTTP requests over SOCKS.

EDIT:

Alright, 1 more edit! Seconds after finishing my last post, I've found a way to do it. This article shows us how to set up something called Privoxy, which translates SOCKS requests into HTTP requests. Put that in front of Tor and blamo, we're sending proxied HTTP requests through Tor!

Solution 3

you have to intercept the dns lookup request from the php script by configuring tor with the "dnsport" directive. then you have to configure a "transport" for tor and a "virtualnetworkaddress". now what happens when your php script does a dns-lookup thru tor is that tor sees a request for a onion address and answers with a ip address from the "virtualnetworkaddress" range. you now have to redirect the traffic going to this address to the address defined with "transport". read "torrc" manual on "automaphostonresolve", "virtualnetworkaddress", "dnsport" and "transport".

Solution 4

I think it is as simple as running your command line request with the usewithtor or torifyoption. For example:

$ usewithtor crawl.php

And the script will be able to interact with .onion sites. Having build a crawler for Tor myself, I definitely would not go this route for production use, I instead use python, PySocks, and other crawler libraries instead of CURL. Hopefully this answers your question and gives you some ideas for other implementation strategies moving forward.

Thanks

Share:
18,222
user1203301
Author by

user1203301

Updated on July 17, 2022

Comments

  • user1203301
    user1203301 almost 2 years

    Ok, here's what I need. I have a PHP based web crawler. It is accessible here: http://rz7ocnxxu7ka6ncv.onion/ Now, my problem is that my spider that actually crawls pages needs to do so on a SOCKS port 9050. The thing is, I have to tunnel its connection through Tor so that It can resolve .onion domains, which is what I'm indexing. (Only ending in .onion.) I call this script from the command line using php crawl.php, and I add the appropriate parameters to crawl the page. Here is what I think: Is there any way to force it to use Tor? OR can i force my ENTIRE MACHINE to tunnel things through Tor, and how? (Like forcing all traffic through 127.0.0.1:9050) perhaps if i set up global proxy settings, php would respect them?

    If any of my solutions work, how would I do it? (Step by step instructions please, I am a noob.)

    I just want to crate my own Tor search engine. (Don't recommend my p2p search engines- it's not what I want for this- I know they exist, I did my homework.) Here is the crawler source if you are interested to take a look at: Perhaps someone with a kind heart can modify it to use 127.0.0.1:9050 for all crawling requests? http://pastebin.com/kscGJCc5

  • user1203301
    user1203301 over 12 years
    ive read that article hundreds of times over the past week. It does not work- trust me.
  • quickshiftin
    quickshiftin over 12 years
    I updated my answer. It's super-easy to send requests to Tor on localhost, but the challenge is sending HTTP requests over a SOCKS connection. See the end of the revised answer that points to a library claiming it can do just that.
  • quickshiftin
    quickshiftin over 12 years
    OK, seconds later I found something called Privoxy, now sending proxied HTTP requests through Tor. Thanks for pushing me, this is something I'd wanted to figure out anyways.
  • user1203301
    user1203301 over 12 years
    Privoxy is the only thing I have yet to try. I am going to see if I can start that PHP crawler through TOR requests. I'll report back if your method works. :)
  • uınbɐɥs
    uınbɐɥs almost 11 years
    This does not answer the question.
  • nKn
    nKn over 10 years
    Adding an example would be great, putting all that along for an unexperienced user may be harder than seeing an example.
  • JaseC
    JaseC over 9 years
    I actually like this because some may only have access to lots of places that run php instead of having access to one dedicated/VPS where they can install privoxy. If you have say a dozen hosting accounts with different ips you could set up your own small proxy network.