How to extract HTML links and text using Nokogiri (and XPATH and CSS)

10,724

This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.

Here are some common operations you might do when parsing links in HTTP, shown both in css and xpath syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
    <a href="http://google.com">link1</a>
</div>
<div id="block2">
    <a href="http://stackoverflow.com">link2</a>
    <a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

extracting all the links

We can use xpath or css to find all the <a> elements and then keep only the ones that have an href attribute:

nodeset = doc.xpath('//a')      # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact  # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a')          # Get all anchors via css
nodeset.map {|element| element["href"]}.compact  # => ["http://google.com", "http://stackoverflow.com"]

In the above cases, the .compact is necessary because the search for the <a> element returns the "just a bookmark" element in addition to the others.

But we can use a more refined search to find just the elements that contain an href attribute:

attrs = doc.xpath('//a/@href')  # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value}   # => ["http://google.com", "http://stackoverflow.com"]

nodeset = doc.css('a[href]')    # Get anchors w href attribute via css
nodeset.map {|element| element["href"]}  # => ["http://google.com", "http://stackoverflow.com"]

finding a specific link

To find a link within the <div id="block2">

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use at_xpath or at_css instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value          # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')
element['href']        # => "http://stackoverflow.com"

find a link from associated text

What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"]     # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"]     # => "http://stackoverflow.com"

find text from a link

For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text     # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text     # => "link2"

useful references

In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:

Share:
10,724
fearless_fool
Author by

fearless_fool

Embedded Processor Wizard, well seasoned and steeped in the MIT Media Lab culture of building cool things. For the last several decades, I've thrived on cramming lots of functionality into tiny processors. One of my specialities is exploiting the properties of single chip devices (e.g. GPIO ports, PWM timers, etc) to create robust designs with minimal parts count. My first startup, Ember Corporation (bought by Silicon Labs) ushered in the Internet of Things by releasing the first microcontrollers with embedded wireless mesh networking. Long before Ember, I made 6502, Z80 and PIC processors jump through hoops to control laser printers, environmental sensors, audio devices, lighting systems and electronic whoopee cushions. More recently, I've been working with RPi, various Arduino (including Intel Arduino 101), Freescale/NXP KL2xx, and I look forward creating new things on the ESP32, GR8, AM335x and/or nRF52 family of processors. My work doesn't stop at the microcontroller level: I use C, C++, Python, Javascript/Node, Ruby and other languages to connect the microcontrollers into cloud-based applications.

Updated on June 18, 2022

Comments

  • fearless_fool
    fearless_fool almost 2 years

    (Update: This answer is written from the point of view of Nokogiri, but it's also useful if you're looking for the XPATH or CSS syntax for specific queries.)

    I love Nokogiri -- it's a wonderful tool for extracting elements from XML and HTML documents. Although the online examples are good, they mostly show how to manipulate XML documents.

    How can you extract extract links and link text from HTML using Nokogiri?

  • JanuskaE
    JanuskaE over 5 years
    This is the best Nokogiri post ever.