Web crawler using perl
Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.
This seems to work, perhaps it will give you some ideas.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::URL;
##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;
##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my @urls = $base;
my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;
while (@urls) {
my $url = shift @urls;
next if exists $visited{$url};
print "$url\n";
print $log "$url\n";
$visited{$url} = 1;
$url_count++;
# find all <a> tags and act on each
$ua->get($url)->res->dom('a')->each(sub{
my $url = Mojo::URL->new($_->{href});
if ( $url->is_abs ) {
return unless $url->host eq $base->host;
}
push @urls, $url;
});
last if $url_count == 100;
sleep 1;
}
user2154731
Updated on September 16, 2022Comments
-
user2154731 over 1 year
I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.
use strict; use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; ##open file to store links open my $file1,">>", ("extracted_links.txt"); select($file1); ##starting URL my @urls = 'http://stackoverflow.com/'; my $browser = LWP::UserAgent->new('IE 6'); $browser->timeout(10); my %visited; my $url_count = 0; while (@urls) { my $url = shift @urls; if (exists $visited{$url}) ##check if URL already exists { next; } else { $url_count++; } my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); if ($response->is_error()) { printf "%s\n", $response->status_line; } else { my $contents = $response->content(); $visited{$url} = 1; @lines = split(/\n/,$contents); foreach $line(@lines) { $line =~ m@(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g; print "$1\n"; push @urls, $$line[2]; } sleep 60; if ($visited{$url} == 100) { last; } } } close $file1;