How do I parse an HTML table with Nokogiri?

html ruby nokogiri mechanize html-table

28,410

#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]

28,410

Author by

Radek

Updated on October 18, 2020

Comments

Radek over 3 years
I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?
```
<table >
  <tbody>
    <tr>   </tr>
  </tbody>
  
  <tbody id="threadbits_forum_251">
    <tr>
      <td></td>
      <td></td>
      <td>
        <div>
          <a href="showthread.php?t=230708" >Vb4 Gold Released</a>
        </div>
        <div>
          <span><a>Paul M</a></span>
        </div>
      </td>
      <td>
          06 Jan 2010 <span class="time">23:35</span><br />
          by <a href="member.php?find=lastposter&amp;t=230708">shane943</a> 
        </div>
      </td>
      <td><a href="#">24</a></td>
      <td>1,320</td>
    </tr>

  </tbody>
</table>
```
kejadlen over 14 years

I think the css equivalent would be doc.css('tbody#threadbits_forum_251 tr'), but I haven't actually tested that in code...
Wayne Conrad over 14 years

@Kejadlen, I replaced the doc.xpath(...) call with your doc.css call, and it worked great.
Radek over 14 years

is it possible that somebody would explain the syntax to me? thank you in advance.
Wayne Conrad over 14 years

What's got you stumped? Is it the Ruby syntax, the xpath syntax, or both?
Radek over 14 years

hi Wayne, I am ruby baby. First of all ... I installed mechanize and it was said that it uses nokogiri to parse so I can use html nokogiri methods.I cannot make it work with setpu like that.Do I have to install nokogiri separately?But it seems to me that I have it already installed. doc = Nokogiri::XML(f) gives me an error ./nokogiri.rb:7: uninitialized constant Nokogiri (NameError). And then to be honest I did not understand xpath too. //table/tbody[@id="threadbits_forum_251"]/tr is like magic from different world for me. I'd say that it means search for table&tbody where id=xxx but why/tr
Radek over 14 years

and why does it start whith // ? I cannot find any good (good enough for ME) documentation on that...
Wayne Conrad over 14 years

Yes, you already have nokogiri. See stackoverflow.com/questions/2060247/… for an example using mechanize. That example doesn't directly use nokogiri, except on the commented-out line to print the fetched html. But nokogiri is there inside mechanize if you need it (just call page.parser). The xpath you quoted means "get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251."
Radek over 14 years

@Wayne,thank you sooooo much.I updated the code following your other example and it is working now very nicely. I still have few questions.The most important is if you could suggest any documentation for me.Next one is why there is /tr at the end of the xpath you nicely explained to me.I want to extract url of the post too I tried [:url, 'td[3]/div[1]/a'], [:url, 'td[3]/div[1]/a href/text()'], [:url, 'td[3]/div[1]/a/href/text()'],[:url, 'td[3]/div[1]/a/href'], and nothing worked.Where can I learn how to extract href, id, alt, src etc? Thank you
Radek over 14 years

@Wayne and another question is that I want to add some info from the post itself so I have to click it and add the info to the detail object. Where in your code I can add such code? I hope I am not asking much.. could you explain the code after details ??? Thank you
Radek over 14 years

the forum I use to learn mechanize/nokorigi/parsing is vbulletin.org/forum/forumdisplay.php?f=251
Wayne Conrad over 14 years

Radek, These are all great questions. What would you say to creating more SO questions? That way you'll get more people's answers.
Radek over 14 years

@Wayne Conrad: Wayne can I ask why you use array of hashes to store the data? why not hash of hashes or object? thank you
Wayne Conrad over 14 years

Mostly, because an array of hashes was the simplest thing that could possibly work, making for a clearer example. Also, and I don't know if this matters for you, in Ruby < 1.9, hashes don't have a well-defined order so you lose the original order of the rows.