How to get the raw HTML of a node

14,539

Solution 1

Use node.to_s, or just node:

nodes = doc.search("//tr[@class='tablebX']")
nodes.each do |node|
   puts node.to_s
   puts '-'*40
end

With additional sanity-check HTML (yours, doubled, with a tr of a different class in the middle) I get:

<tr class="tableX">
<td align="center">
<font size="2"><a href="javascript:open('9746')">9746</a></font> 
            </td>
            <td align="center"><font size="2">2012-06-26</font></td>
</tr>
----------------------------------------
<tr class="tableX">
<td align="center">
<font size="2"><a href="javascript:open('9746')">9746</a></font> 
            </td>
            <td align="center"><font size="2">2012-06-26</font></td>
</tr>
----------------------------------------

Solution 2

You can add children.to_html. Try to do that below:

doc = Nokogiri::HTML(html)

nodes = doc.search("//tr[@class='tablebX']")

nodes.each do |node|
   node.children.to_html # or node.content
end

Solution 3

The correct method is .children. It returns all the html inside the selected element.

So having this code:

<tr class="container">
  <td>value</td>
</tr>

And using this process:

data = Nokogiri::HTML(html)
data.css("tr.container").children

Will return this html:

<td>value</td>

I guess my answer is too late but that's the exact codes you need.

Share:
14,539

Related videos on Youtube

Kyaw Siesein
Author by

Kyaw Siesein

Software Engineer, rails, ruby, django,web2py,puppet, python,linux

Updated on September 15, 2022

Comments

  • Kyaw Siesein
    Kyaw Siesein over 1 year

    I am using Nokogiri to analyze some HTML, but, I don't know how get the raw HTML inside a node.

    For example, given:

    <tr class="tableX">
      <td align="center">
        <font size="2"><a href="javascript:open('9746')">9746</a></font>
      </td>
      <td align="center">
        <font size="2">2012-06-26</font>
      </td>
    </tr>
    

    When I use this XPath selector:

    doc = Nokogiri::HTML(html)
    
    nodes = doc.search("//tr[@class='tablebX']")
    
    nodes.each do |node|
       node.text # or node.content
    end
    

    The results from node.text and node.content are:

    9746
    2012-06-26
    

    I want to get all raw HTML inside the tr block, which, in this case, is:

    <td align="center">
      <font size="2"><a href="javascript:open('9746')">9746</a></font>
    </td>
    <td align="center">
      <font size="2">2012-06-26</font>
    </td>
    

    What's the proper way to do that?

    • PJP
      PJP almost 4 years
      Node's to_html will give you the original HTML.
  • PJP
    PJP almost 4 years
    Children doesn't return raw HTML, it only returns the NodeSet containing the children of the parent node. The OP wants the raw HTML. Node#to_html or its aliases do that.