How can I extract URL and link text from HTML in Perl?
Solution 1
Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.
my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
printf "%s, %s\n", $link->text, $link->url;
}
Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.
Mech is basically a browser in an object.
Solution 2
Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.
HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
Solution 3
If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):
#!/usr/bin/perl
if($#ARGV < 0) {
print "$0: Need URL argument.\n";
exit 1;
}
my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);
foreach my $c (@links){
$c =~ /<a.*href="([\s\S]+?)".*>/;
$link = $1;
$c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
$title = $1;
print "$title, $link\n";
}
There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).
Solution 4
I like using pQuery for things like this...
use pQuery;
pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
sub {
say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
}
);
Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.
Solution 5
Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.
my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
while (my $node=$nodes->shift) {
my $t=$node->attr('title');
}
Admin
Updated on April 22, 2020Comments
-
Admin about 4 years
I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.
If the page contained these links:
<a href="http://www.google.com">Google</a> <a href="http://www.apple.com">Apple</a>
The output would be:
Google, http://www.google.com Apple, http://www.apple.com
What is the best way to do this in Perl?
-
cjm over 15 yearsUnfortunately, HTML::LinkExtor can't give you the text inside the <a> tag, which he says he's interested in. It only tells you the tag name and its attributes.
-
cjm over 15 yearsI took the liberty of changing the print statement to include the link text, as requested by melling.
-
Susheel Javadi over 13 yearsAlso, add a $tree->delete to avoid memory leaks.
-
run over 11 yearsyou are the master, you saved lot of time for me..thanks a ton.
-
Yaakov Belch almost 10 years@cjm: I added a link to HTML::LinkExtractor which produces the link text in addition to the URLs.