How do I parse an HTML table with Nokogiri?

28,410
#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]
Share:
28,410
Radek
Author by

Radek

Updated on October 18, 2020

Comments

  • Radek
    Radek over 3 years

    I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

    What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

    Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?

    <table >
      <tbody>
        <tr>  <!-- table header --> </tr>
      </tbody>
      <!-- show threads -->
      <tbody id="threadbits_forum_251">
        <tr>
          <td></td>
          <td></td>
          <td>
            <div>
              <a href="showthread.php?t=230708" >Vb4 Gold Released</a>
            </div>
            <div>
              <span><a>Paul M</a></span>
            </div>
          </td>
          <td>
              06 Jan 2010 <span class="time">23:35</span><br />
              by <a href="member.php?find=lastposter&amp;t=230708">shane943</a> 
            </div>
          </td>
          <td><a href="#">24</a></td>
          <td>1,320</td>
        </tr>
    
      </tbody>
    </table>
    
  • kejadlen
    kejadlen over 14 years
    I think the css equivalent would be doc.css('tbody#threadbits_forum_251 tr'), but I haven't actually tested that in code...
  • Wayne Conrad
    Wayne Conrad over 14 years
    @Kejadlen, I replaced the doc.xpath(...) call with your doc.css call, and it worked great.
  • Radek
    Radek over 14 years
    is it possible that somebody would explain the syntax to me? thank you in advance.
  • Wayne Conrad
    Wayne Conrad over 14 years
    What's got you stumped? Is it the Ruby syntax, the xpath syntax, or both?
  • Radek
    Radek over 14 years
    hi Wayne, I am ruby baby. First of all ... I installed mechanize and it was said that it uses nokogiri to parse so I can use html nokogiri methods.I cannot make it work with setpu like that.Do I have to install nokogiri separately?But it seems to me that I have it already installed. doc = Nokogiri::XML(f) gives me an error ./nokogiri.rb:7: uninitialized constant Nokogiri (NameError). And then to be honest I did not understand xpath too. //table/tbody[@id="threadbits_forum_251"]/tr is like magic from different world for me. I'd say that it means search for table&tbody where id=xxx but why/tr
  • Radek
    Radek over 14 years
    and why does it start whith // ? I cannot find any good (good enough for ME) documentation on that...
  • Wayne Conrad
    Wayne Conrad over 14 years
    Yes, you already have nokogiri. See stackoverflow.com/questions/2060247/… for an example using mechanize. That example doesn't directly use nokogiri, except on the commented-out line to print the fetched html. But nokogiri is there inside mechanize if you need it (just call page.parser). The xpath you quoted means "get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251."
  • Radek
    Radek over 14 years
    @Wayne,thank you sooooo much.I updated the code following your other example and it is working now very nicely. I still have few questions.The most important is if you could suggest any documentation for me.Next one is why there is /tr at the end of the xpath you nicely explained to me.I want to extract url of the post too I tried [:url, 'td[3]/div[1]/a'], [:url, 'td[3]/div[1]/a href/text()'], [:url, 'td[3]/div[1]/a/href/text()'],[:url, 'td[3]/div[1]/a/href'], and nothing worked.Where can I learn how to extract href, id, alt, src etc? Thank you
  • Radek
    Radek over 14 years
    @Wayne and another question is that I want to add some info from the post itself so I have to click it and add the info to the detail object. Where in your code I can add such code? I hope I am not asking much.. could you explain the code after details ??? Thank you
  • Radek
    Radek over 14 years
    the forum I use to learn mechanize/nokorigi/parsing is vbulletin.org/forum/forumdisplay.php?f=251
  • Wayne Conrad
    Wayne Conrad over 14 years
    Radek, These are all great questions. What would you say to creating more SO questions? That way you'll get more people's answers.
  • Radek
    Radek over 14 years
    @Wayne Conrad: Wayne can I ask why you use array of hashes to store the data? why not hash of hashes or object? thank you
  • Wayne Conrad
    Wayne Conrad over 14 years
    Mostly, because an array of hashes was the simplest thing that could possibly work, making for a clearer example. Also, and I don't know if this matters for you, in Ruby < 1.9, hashes don't have a well-defined order so you lose the original order of the rows.