Parse XML with Nokogiri
Solution 1
Here's how I'd rewrite your code:
xml = Nokogiri::XML(open("http://www.kongregate.com/games_for_your_site.xml"))
xml.xpath("//game").each do |game|
%w[id title thumbnail category flash_file width height description instructions].each do |n|
puts game.at(n)
end
end
The problem in your code is that all the sub-tags are prefixed with //
which, in XPath-ese, means, "start at the root node and search downwards for all tags containing that text." So, instead of only searching inside each of the //game
nodes, it searched the entire document for each of the listed tags for each //game
node.
I recommend using CSS accessors over XPath, because they are simpler (usually) and easier to read as a result. So, instead of xpath('//game')
I use search('game')
. (search
will take a CSS or XPath accessor, as will at
.)
If you want the text contained in the tags, change puts game.at(n)
to:
puts game.at(n).text
To make the output more useful I'd do this:
require 'nokogiri'
require 'open-uri'
xml = Nokogiri::XML(open('http://www.kongregate.com/games_for_your_site.xml'))
games = xml.search('game').map do |game|
%w[
id title thumbnail category flash_file width height description instructions
].each_with_object({}) do |n, o|
o[n] = game.at(n).text
end
end
require 'awesome_print'
puts games.size
ap games.first
ap games.last
Which results in:
395
{
"id" => "160342",
"title" => "Tricky Rick",
"thumbnail" => "http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op",
"category" => "Puzzle",
"flash_file" => "http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf",
"width" => "640",
"height" => "480",
"description" => "Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles!\n",
"instructions" => "WASD \\ Arrow Keys – move;\nS \\ Down Arrow – take\\release an object;\nCNTRL – interaction with objects: throw, hammer strike, invisibility mode;\nSPACE – interaction with elevators and fuel stations;\nEsc \\ P – pause;\n"
}
{
"id" => "78",
"title" => "rotaZion",
"thumbnail" => "http://cdn2.kongregate.com/game_icons/0000/0115/pixtiz.rotazion_icon.jpg?8217-op",
"category" => "Action",
"flash_file" => "http://external.kongregate-games.com/gamez/0000/0078/live/embeddable_78.swf",
"width" => "350",
"height" => "350",
"description" => "In rotaZion, you play with a bubble bar that you can’t stop rotating !\nCollect the bubbles and try to avoid the mines !\nCollect the different bonus to protect your bubble bar, makes the mines go slower or destroy all the mines !\nTry to beat 100.000 points ;)\n",
"instructions" => "Move the bubble bar with the arrow keys !\nBubble = 500 Points !\nPixtiz sign = 5000 Points !\n"
}
Solution 2
You can try something like this. I would suggest creating an array for the elements inside of game that you want and then iterate over them. I'm sure there's a way to get all of the elements inside the specified one in Nokogiri but this works:
xml = Nokogiri::XML(result)
xml.css("game").each do |inv|
inv.css("title").each do |f| # title or whatever else you want
puts f.inner_html
end
end
thebusiness11
Updated on June 04, 2022Comments
-
thebusiness11 almost 2 years
Having some issues getting the proper setup for Nokogiri and their documentation is a little rough to get started with.
I am trying to parse the XML file: http://www.kongregate.com/games_for_your_site.xml
Which returns multiple games inside the gameset, and for each game it has a title, desc, etc....
<gameset> <game> <id>160342</id> <title>Tricky Rick</title> <thumbnail> http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op </thumbnail> <launch_date>2012-12-12</launch_date> <category>Puzzle</category> <flash_file> http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf </flash_file> <width>640</width> <height>480</height> <url> http://www.kongregate.com/games/tAMAS_Games/tricky-rick </url> <description> Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles! </description> <instructions> WASD \ Arrow Keys – move; S \ Down Arrow – take\release an object; CNTRL – interaction with objects: throw, hammer strike, invisibility mode; SPACE – interaction with elevators and fuel stations; Esc \ P – pause; </instructions> <developer_name>tAMAS_Games</developer_name> <gameplays>24999</gameplays> <rating>3.43</rating> </game> <game> <id>160758</id> <title>Flying Cookie Quest</title> <thumbnail> http://cdn2.kongregate.com/game_icons/0042/8428/icon_cookiequest_kong_250x200_site.png?16578-op </thumbnail> <launch_date>2012-12-07</launch_date> <category>Action</category> <flash_file> http://external.kongregate-games.com/gamez/0016/0758/live/embeddable_160758.swf </flash_file> <width>640</width> <height>480</height> <url> http://www.kongregate.com/games/LongAnimals/flying-cookie-quest </url> <description> Launch Rocket Panda into the land of Cookies. With the help of low-flying sharks, hang-gliding sheep and Rocket Badger, can you defeat the all powerful Biscuit Head? Defeat All enemies of cookies in this launcher game. </description> <instructions>Use the mouse button!</instructions> <developer_name>LongAnimals</developer_name> <gameplays>168672</gameplays> <rating>3.67</rating> </game>
From the documentation, I am using something like:
require 'nokogiri' require 'open-uri' url = "http://www.kongregate.com/games_for_your_site.xml" xml = Nokogiri::XML(open(url)) xml.xpath("//game").each do |node| puts node.xpath("//id") puts node.xpath("//title") puts node.xpath("//thumbnail") puts node.xpath("//category") puts node.xpath("//flash_file") puts node.xpath("//width") puts node.xpath("//height") puts node.xpath("//description") puts node.xpath("//instructions") end
But, it just returns endless data, and not in a set. Any help would be helpful.
-
nikhil over 11 yearsExcellent answer. +1 for all the explanation along with the code.
-
PJP over 11 yearsThe XPath
//
fools everyone when they start working with it. -
pguardiario over 11 yearsinner_html is rarely useful. In this case you really want f.text, and since there's only one title per game, there's not much need for an
each
-
thebusiness11 over 11 yearsThis is great, but the end goal is to store it into the database, one row for each game inside the game set. Can this happen out of this array?
-
PJP over 11 yearsEasily. We do it all the time, but how is left for you to figure out. A hint is that each embedded hash is a separate row. If the keys don't map directly to the field names you can create an array with the appropriate field names and
zip
that with thevalues
of each hash, then cast that into a Hash using something likeHash[['foo','bar'].zip(hash.values)]
. Also, some DBMs can directly import XML, so parsing it might not be necessary. Import into a temporary table, drop the fields you don't need, them copy the resulting table into your production table.