Web crawler using perl

10,252

Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.

This seems to work, perhaps it will give you some ideas.

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::UserAgent;
use Mojo::URL;

##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;

##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my @urls = $base;

my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;

while (@urls) {
  my $url = shift @urls;
  next if exists $visited{$url};

  print "$url\n";
  print $log "$url\n";

  $visited{$url} = 1;
  $url_count++;         

  # find all <a> tags and act on each
  $ua->get($url)->res->dom('a')->each(sub{
    my $url = Mojo::URL->new($_->{href});
    if ( $url->is_abs ) {
      return unless $url->host eq $base->host;
    }
    push @urls, $url;
  });

  last if $url_count == 100;

  sleep 1;
}
Share:
10,252
user2154731
Author by

user2154731

Updated on September 16, 2022

Comments

  • user2154731
    user2154731 over 1 year

    I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.

    use strict;
    use warnings;
    
    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    
    
    ##open file to store links
    open my $file1,">>", ("extracted_links.txt");
    select($file1); 
    
    ##starting URL
    my @urls = 'http://stackoverflow.com/';
    
    my $browser = LWP::UserAgent->new('IE 6');
    $browser->timeout(10);
    my %visited;
    my $url_count = 0;
    
    
    while (@urls) 
    {
         my $url = shift @urls;
         if (exists $visited{$url}) ##check if URL already exists
         {
             next;
         }
         else
         {
             $url_count++;
         }         
    
         my $request = HTTP::Request->new(GET => $url);
         my $response = $browser->request($request);
    
         if ($response->is_error()) 
         {
             printf "%s\n", $response->status_line;
         }
         else
         {
             my $contents = $response->content();
             $visited{$url} = 1;
             @lines = split(/\n/,$contents);
             foreach $line(@lines)
             {
                 $line =~ m@(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g;
                 print "$1\n";  
                 push @urls, $$line[2];
             }
    
             sleep 60;
    
             if ($visited{$url} == 100)
             {
                last;
             }
        }
    }
    
    close $file1;