trying to get content inside cdata tags in xml file using nokogiri

10,422

Solution 1

You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.

Use Nokogiri's XML parser and you will get things like this:

>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n  ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n  Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly     creme brulee.\n  ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n  \n  Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly     creme brulee.\n  \n"

and you'll be able to get at the CDATA through r.text or r.children.

Solution 2

Ah I see. What @mu said is correct. But to get at the cdata directly, maybe:

xml =<<EOF
<NewsLineText>
  <![CDATA[
  Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly     creme brulee.
  ]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}
Share:
10,422
Aaron Thomas
Author by

Aaron Thomas

Improving my rails skills.

Updated on June 03, 2022

Comments

  • Aaron Thomas
    Aaron Thomas about 2 years

    I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.

    A snippet of the xml looks like this:

    <NewsLineText>
      <![CDATA[
      Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly     creme brulee.
      ]]>
    </NewsLineText>
    

    I am trying to parse this out to get the text associated with the NewsLineText

    r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
    s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
    t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
    puts r
    puts s ? if s.blank? 'NOTHING' : s
    puts t ? if t.blank? 'NOTHING' : t
    

    What I get in return is

    <newslinetext></newslinetext>
    NOTHING
    NOTHING
    

    So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.

    What do I need to do with nokogiri to get this text?

  • Aaron Thomas
    Aaron Thomas about 12 years
    Bah.. I was using HTML and tried to be case sensitive and it wasn't giving me any results and I couldn't figure out why so I dropped it all to lowercase which worked. Later I tried using Nokogiri's XML parser, but I did it ignoring case and it returned no results. I suppose I should have tried XML and case-sensitive and it would have worked with what I was trying. I will check this out and let you know the results.
  • Aaron Thomas
    Aaron Thomas about 12 years
    You were all correct. I was unintentionally using the HTML parser which forced me to use lowercase. Then when I tried to use the XML parser, I got no results(because I was already using lowercase). After seeing the answers here, I realized my idiocy and switched to case sensitivity and XML. Works perfect. thanks
  • Alex
    Alex about 9 years
    nokogiri_doc_object.xpath("/root/element").children[0].text