Parse XML with Nokogiri

11,117

Solution 1

Here's how I'd rewrite your code:

xml = Nokogiri::XML(open("http://www.kongregate.com/games_for_your_site.xml"))
xml.xpath("//game").each do |game|
  %w[id title thumbnail category flash_file width height description instructions].each do |n|
    puts game.at(n)
  end
end

The problem in your code is that all the sub-tags are prefixed with // which, in XPath-ese, means, "start at the root node and search downwards for all tags containing that text." So, instead of only searching inside each of the //game nodes, it searched the entire document for each of the listed tags for each //game node.

I recommend using CSS accessors over XPath, because they are simpler (usually) and easier to read as a result. So, instead of xpath('//game') I use search('game'). (search will take a CSS or XPath accessor, as will at.)

If you want the text contained in the tags, change puts game.at(n) to:

puts game.at(n).text

To make the output more useful I'd do this:

require 'nokogiri'
require 'open-uri'

xml = Nokogiri::XML(open('http://www.kongregate.com/games_for_your_site.xml'))
games = xml.search('game').map do |game|
  %w[
    id title thumbnail category flash_file width height description instructions
  ].each_with_object({}) do |n, o|
    o[n] = game.at(n).text
  end
end

require 'awesome_print'
puts games.size
ap games.first
ap games.last

Which results in:

395
{
              "id" => "160342",
          "title"  => "Tricky Rick",
      "thumbnail"  => "http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op",
        "category" => "Puzzle",
      "flash_file" => "http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf",
          "width"  => "640",
          "height" => "480",
    "description"  => "Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles!\n",
    "instructions" => "WASD \\ Arrow Keys – move;\nS \\ Down Arrow – take\\release an object;\nCNTRL – interaction with objects: throw, hammer strike, invisibility mode;\nSPACE – interaction with elevators and fuel stations;\nEsc \\ P – pause;\n"
}
{
              "id" => "78",
          "title"  => "rotaZion",
      "thumbnail"  => "http://cdn2.kongregate.com/game_icons/0000/0115/pixtiz.rotazion_icon.jpg?8217-op",
        "category" => "Action",
      "flash_file" => "http://external.kongregate-games.com/gamez/0000/0078/live/embeddable_78.swf",
          "width"  => "350",
          "height" => "350",
    "description"  => "In rotaZion, you play with a bubble bar that you can’t stop rotating !\nCollect the bubbles and try to avoid the mines !\nCollect the different bonus to protect your bubble bar, makes the mines go slower or destroy all the mines !\nTry to beat 100.000 points ;)\n",
    "instructions" => "Move the bubble bar with the arrow keys !\nBubble = 500 Points !\nPixtiz sign = 5000 Points !\n"
}

Solution 2

You can try something like this. I would suggest creating an array for the elements inside of game that you want and then iterate over them. I'm sure there's a way to get all of the elements inside the specified one in Nokogiri but this works:

   xml = Nokogiri::XML(result)
    xml.css("game").each do |inv|
      inv.css("title").each do |f|  # title or whatever else you want
        puts f.inner_html
      end
    end
Share:
11,117
thebusiness11
Author by

thebusiness11

Updated on June 04, 2022

Comments

  • thebusiness11
    thebusiness11 almost 2 years

    Having some issues getting the proper setup for Nokogiri and their documentation is a little rough to get started with.

    I am trying to parse the XML file: http://www.kongregate.com/games_for_your_site.xml

    Which returns multiple games inside the gameset, and for each game it has a title, desc, etc....

    <gameset>
      <game>
        <id>160342</id>
        <title>Tricky Rick</title>
        <thumbnail>
          http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op
        </thumbnail>
        <launch_date>2012-12-12</launch_date>
        <category>Puzzle</category>
        <flash_file>
          http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf
        </flash_file>
        <width>640</width>
        <height>480</height>
        <url>
          http://www.kongregate.com/games/tAMAS_Games/tricky-rick
        </url>
        <description>
          Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles!
        </description>
        <instructions>
          WASD \ Arrow Keys &#8211; move; S \ Down Arrow &#8211; take\release an object; CNTRL &#8211; interaction with objects: throw, hammer strike, invisibility mode; SPACE &#8211; interaction with elevators and fuel stations; Esc \ P &#8211; pause;
        </instructions>
        <developer_name>tAMAS_Games</developer_name>
        <gameplays>24999</gameplays>
        <rating>3.43</rating>
      </game>
      <game>
        <id>160758</id>
        <title>Flying Cookie Quest</title>
        <thumbnail>
          http://cdn2.kongregate.com/game_icons/0042/8428/icon_cookiequest_kong_250x200_site.png?16578-op
        </thumbnail>
        <launch_date>2012-12-07</launch_date>
        <category>Action</category>
        <flash_file>
          http://external.kongregate-games.com/gamez/0016/0758/live/embeddable_160758.swf
        </flash_file>
        <width>640</width>
        <height>480</height>
        <url>
          http://www.kongregate.com/games/LongAnimals/flying-cookie-quest
        </url>
        <description>
          Launch Rocket Panda into the land of Cookies. With the help of low-flying sharks, hang-gliding sheep and Rocket Badger, can you defeat the all powerful Biscuit Head? Defeat All enemies of cookies in this launcher game.
        </description>
        <instructions>Use the mouse button!</instructions>
        <developer_name>LongAnimals</developer_name>
        <gameplays>168672</gameplays>
        <rating>3.67</rating>
      </game>
    

    From the documentation, I am using something like:

    require 'nokogiri'
    require 'open-uri'
    
    url = "http://www.kongregate.com/games_for_your_site.xml"
    xml = Nokogiri::XML(open(url))
    xml.xpath("//game").each do |node|
        puts node.xpath("//id")
        puts node.xpath("//title")
        puts node.xpath("//thumbnail")
        puts node.xpath("//category")
        puts node.xpath("//flash_file")
        puts node.xpath("//width")
        puts node.xpath("//height")
        puts node.xpath("//description")
        puts node.xpath("//instructions")
    end
    

    But, it just returns endless data, and not in a set. Any help would be helpful.

  • nikhil
    nikhil over 11 years
    Excellent answer. +1 for all the explanation along with the code.
  • PJP
    PJP over 11 years
    The XPath // fools everyone when they start working with it.
  • pguardiario
    pguardiario over 11 years
    inner_html is rarely useful. In this case you really want f.text, and since there's only one title per game, there's not much need for an each
  • thebusiness11
    thebusiness11 over 11 years
    This is great, but the end goal is to store it into the database, one row for each game inside the game set. Can this happen out of this array?
  • PJP
    PJP over 11 years
    Easily. We do it all the time, but how is left for you to figure out. A hint is that each embedded hash is a separate row. If the keys don't map directly to the field names you can create an array with the appropriate field names and zip that with the values of each hash, then cast that into a Hash using something like Hash[['foo','bar'].zip(hash.values)]. Also, some DBMs can directly import XML, so parsing it might not be necessary. Import into a temporary table, drop the fields you don't need, them copy the resulting table into your production table.