How to extract HTML links and text using Nokogiri (and XPATH and CSS)
This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.
Here are some common operations you might do when parsing links in HTTP, shown both in css
and xpath
syntax.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
extracting all the links
We can use xpath or css to find all the <a>
elements and then keep only the ones that have an href
attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
In the above cases, the .compact
is necessary because the search for the <a>
element returns the "just a bookmark" element in addition to the others.
But we can use a more refined search to find just the elements that contain an href
attribute:
attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
finding a specific link
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath
or at_css
instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
find a link from associated text
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
find text from a link
For completeness, here's how you'd get the text associated with a particular link:
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
useful references
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
- a handy Nokogiri cheat sheet
- a tutorial on parsing HTML with Nokogiri
- interactively test CSS selector queries
fearless_fool
Embedded Processor Wizard, well seasoned and steeped in the MIT Media Lab culture of building cool things. For the last several decades, I've thrived on cramming lots of functionality into tiny processors. One of my specialities is exploiting the properties of single chip devices (e.g. GPIO ports, PWM timers, etc) to create robust designs with minimal parts count. My first startup, Ember Corporation (bought by Silicon Labs) ushered in the Internet of Things by releasing the first microcontrollers with embedded wireless mesh networking. Long before Ember, I made 6502, Z80 and PIC processors jump through hoops to control laser printers, environmental sensors, audio devices, lighting systems and electronic whoopee cushions. More recently, I've been working with RPi, various Arduino (including Intel Arduino 101), Freescale/NXP KL2xx, and I look forward creating new things on the ESP32, GR8, AM335x and/or nRF52 family of processors. My work doesn't stop at the microcontroller level: I use C, C++, Python, Javascript/Node, Ruby and other languages to connect the microcontrollers into cloud-based applications.
Updated on June 18, 2022Comments
-
fearless_fool almost 2 years
(Update: This answer is written from the point of view of Nokogiri, but it's also useful if you're looking for the XPATH or CSS syntax for specific queries.)
I love Nokogiri -- it's a wonderful tool for extracting elements from XML and HTML documents. Although the online examples are good, they mostly show how to manipulate XML documents.
How can you extract extract links and link text from HTML using Nokogiri?
-
JanuskaE over 5 yearsThis is the best Nokogiri post ever.