Extract Links from a sitemap(xml)
Solution 1
You can use python script here
This script get any links started with http
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('>(http:\/\/.+)<',d)
for i in data:
print i
And in your case next script find all data wraped in tags
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
for i in data:
print i
Here nice tool to play with regexp if you not familiar with it.
if you need to load remote file you can use next code
import urllib2 as ur
import re
f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
for i in data:
print i
Solution 2
If you're on a Linux box or something with the grep tool, you can just run:
grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml
Solution 3
This could be accomplished by a single sed command, which seems to be more solid than the grep solution:
sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile
(found at: linuxquestions.org)
Solution 4
Using XSLT
, you can render it out with XPath
/url/loc
Related videos on Youtube
Akshat Mittal
TL;DR I make things. I am primarily a Web & App Developer by profession, but I blog when I feel like. I'm strongly passionate about how things work, and I love to code. Currently building Metacrypt & CreateMyToken.
Updated on September 18, 2022Comments
-
Akshat Mittal almost 2 years
Lets say I have a
sitemap.xml
file with this data:<url> <loc>http://domain.com/pag1</loc> <lastmod>2012-08-25</lastmod> <changefreq>weekly</changefreq> <priority>0.9</priority> </url> <url> <loc>http://domain.com/pag2</loc> <lastmod>2012-08-25</lastmod> <changefreq>weekly</changefreq> <priority>0.9</priority> </url> <url> <loc>http://domain.com/pag3</loc> <lastmod>2012-08-25</lastmod> <changefreq>weekly</changefreq> <priority>0.9</priority> </url>
I want to extract all the locations from it (data between
<loc>
and</loc>
).Sample output be like:
http://domain.com/pag1 http://domain.com/pag2 http://domain.com/pag3
How to do this?
-
Admin almost 12 yearsWhat OS are you using?
-
Admin almost 12 yearsWindows 7 Ultimate X64 / Windows 8 Pro X64 or Ubuntu 12.04 Linux.
-
Admin almost 12 yearsNice setup. Using Terminal on the Ubuntu box, my answer below will get you what you need.
-
Admin almost 12 yearsYou can also use any text editor like SublimeText2 which can use regexp, you can get all data with it, or you can use python see my answer below.
-
-
slhck almost 12 yearsCould you maybe expand your answer and show the XSLT instructions and the XPath queries needed?
-
Akshat Mittal almost 12 years@slhck Exactly what I wanted to say,The answer should be more explainatory.
-
Akshat Mittal almost 12 yearsThis worked but with a lot of mistakes (Incomplete URL's).
-
Akshat Mittal almost 12 yearsI read a few more about this and got this working at last. Upvoting but not a really good answer to be choosen.
-
Akshat Mittal almost 12 yearsHow to load a remote file like
http://server.com/sitemap.xml
. I am not so known to Python -
Ishikawa Yoshi almost 12 yearsyou mean load with python?
-
Akshat Mittal almost 12 yearsYup, Like you have used
f = open('sitemap.xml','r')
to open the file, How to open a remote file on http server? -
Ishikawa Yoshi almost 12 yearsdo you import re module?
-
Ishikawa Yoshi almost 12 years
-
bobmagoo almost 12 yearsWeird, I just ran this over Google's sitemap.xml file and didn't see any issues. Which ones did it miss?
-
Akshat Mittal almost 12 yearsThis missed many url's that contained "?" and "+".
-
trante almost 10 yearsThank you. For anybody wants to save to file
grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml > links.txt
-
SmallChess about 9 years+1 This is actually a very simple but powerful solution.
-
Baptiste Donaux about 8 yearsYour solution works perfectly.
-
Łukasz Rysiak almost 8 yearsFor years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet <?xml-stylesheet type="text/xsl" version="1.0" href="stylesheet.xsl"?> Then open your xml in browser (it won't work when opening as local file, you have to get it via http)
-
Mike about 7 yearstried it as sed '/<loc>/!d; s/[[:space:]]*<loc>(.*)<\/loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work
-
LarS about 7 yearsI think it's because you did not escape the () with ( and ).
-
My Name about 6 yearsVery good answer! A reminder that if your links are in HTTPS, change http to https in the code).