How can I extract URL and link text from HTML in Perl?

32,008

Solution 1

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.

Solution 2

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Solution 3

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

Solution 4

I like using pQuery for things like this...

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.

Solution 5

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }
Share:
32,008
Admin
Author by

Admin

Updated on April 22, 2020

Comments

  • Admin
    Admin about 4 years

    I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

    If the page contained these links:

    <a href="http://www.google.com">Google</a>
    
    <a href="http://www.apple.com">Apple</a>
    

    The output would be:

    Google, http://www.google.com
    Apple, http://www.apple.com
    

    What is the best way to do this in Perl?

  • cjm
    cjm over 15 years
    Unfortunately, HTML::LinkExtor can't give you the text inside the <a> tag, which he says he's interested in. It only tells you the tag name and its attributes.
  • cjm
    cjm over 15 years
    I took the liberty of changing the print statement to include the link text, as requested by melling.
  • Susheel Javadi
    Susheel Javadi over 13 years
    Also, add a $tree->delete to avoid memory leaks.
  • run
    run over 11 years
    you are the master, you saved lot of time for me..thanks a ton.
  • Yaakov Belch
    Yaakov Belch almost 10 years
    @cjm: I added a link to HTML::LinkExtractor which produces the link text in addition to the URLs.