How do you parse a web page and extract all the href links?

html parsing groovy

16,508

Solution 1

Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.

input = """<html><body>
<a href = "http://www.hjsoft.com/">John</a>
<a href = "http://www.google.com/">Google</a>
<a href = "http://www.stackoverflow.com/">StackOverflow</a>
</body></html>"""

doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
    println "${it.text()}, ${[email protected]()}"
}

Solution 2

A quick google search turned up a nice looking possibility, TagSoup.

Solution 3

I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.

It is also easier to write and to read.

<html>
   <body>
      <a href="1.html">1</a>
      <a href="2.html">2</a>
      <a href="3.html">3</a>
   </body>
</html>

With the html above, this expression "/html/body/a" will list all href elements.

Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Solution 4

Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

View more solutions

16,508

Author by

Admin

Updated on June 13, 2022

Comments

Admin almost 2 years
I want to parse a web page in Groovy and extract all of the href links and the associated text with it.

If the page contained these links:
```
<a href="http://www.google.com">Google</a><br />
<a href="http://www.apple.com">Apple</a>
```
the output would be:
```
Google, http://www.google.com<br />
Apple, http://www.apple.com
```
I'm looking for a Groovy answer. AKA. The easy way!