How can I extract URL and link text from HTML in Perl?

html perl parsing url cpan

32,008

Solution 1

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.

Solution 2

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Solution 3

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

Solution 4

I like using pQuery for things like this...

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.

Solution 5

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }

View more solutions

32,008

Author by

Admin

Updated on April 22, 2020

Comments

Admin about 4 years
I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:
```
<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>
```
The output would be:
```
Google, http://www.google.com
Apple, http://www.apple.com
```
What is the best way to do this in Perl?
cjm over 15 years

Unfortunately, HTML::LinkExtor can't give you the text inside the <a> tag, which he says he's interested in. It only tells you the tag name and its attributes.
cjm over 15 years

I took the liberty of changing the print statement to include the link text, as requested by melling.
Susheel Javadi over 13 years

Also, add a $tree->delete to avoid memory leaks.
run over 11 years

you are the master, you saved lot of time for me..thanks a ton.
Yaakov Belch almost 10 years

@cjm: I added a link to HTML::LinkExtractor which produces the link text in addition to the URLs.