Scraping/Parsing Google search results in Ruby

14,878

Solution 1

This should be very simple thing, have a look at the "Screen Scraping with ScrAPI" screen cast by Ryan Bates. You still can do without scraping libraries, just stick to things like Nokogiri.


From Nokogiri's documentation:

require 'nokogiri'
require 'open-uri'

# Get a Nokogiri::HTML:Document for the page we’re interested in...

doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

# Do funky things with it using Nokogiri::XML::Node methods...

####
# Search for nodes by css
doc.css('h3.r a.l').each do |link|
  puts link.content
end

####
# Search for nodes by xpath
doc.xpath('//h3/a[@class="l"]').each do |link|
  puts link.content
end

####
# Or mix and match.
doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|
  puts link.content
end

Solution 2

I'm unclear as to why you want to be screen scraping in the first place. Perhaps the REST search API would be more appropriate? It will return the results in JSON format, which will be much easier to parse, and save on bandwidth.

For example, if your search was 'foo bar', you could just send a GET request to http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=foo+bar and handle the response.

For more information, see "Google Search REST API" or Google's developer page.

Share:
14,878
Admin
Author by

Admin

Updated on July 02, 2022

Comments

  • Admin
    Admin almost 2 years

    Assume I have the entire HTML of a Google search results page. Does anyone know of any existing code (Ruby?) to scrape/parse the first page of Google search results? Ideally it would handle the Shopping Results and Video Results sections that can spring up anywhere.

    If not, what's the best Ruby-based tool for screenscraping in general?

    To clarify: I'm aware that it's difficult/impossible to get Google search results programmatically/API-wise AND simply CURLing results pages has a lot of issues. There's concensus on both of these points here on stackoverflow. My question is different.