Extract Links from a sitemap(xml)

url xml extract sitemap

13,802

Solution 1

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

Solution 2

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

Solution 3

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

Solution 4

Using XSLT, you can render it out with XPath

/url/loc

View more solutions

13,802

Akshat Mittal

TL;DR I make things. I am primarily a Web & App Developer by profession, but I blog when I feel like. I'm strongly passionate about how things work, and I love to code. Currently building Metacrypt & CreateMyToken.

Updated on September 18, 2022

Comments

Akshat Mittal almost 2 years
Lets say I have a sitemap.xml file with this data:
```
<url>
<loc>http://domain.com/pag1</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag2</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag3</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
```
I want to extract all the locations from it (data between <loc> and </loc>).

Sample output be like:
```
http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3
```
How to do this?
- Admin almost 12 years
  
  What OS are you using?
- Admin almost 12 years
  
  Windows 7 Ultimate X64 / Windows 8 Pro X64 or Ubuntu 12.04 Linux.
- Admin almost 12 years
  
  Nice setup. Using Terminal on the Ubuntu box, my answer below will get you what you need.
- Admin almost 12 years
  
  You can also use any text editor like SublimeText2 which can use regexp, you can get all data with it, or you can use python see my answer below.
slhck almost 12 years

Could you maybe expand your answer and show the XSLT instructions and the XPath queries needed?
Akshat Mittal almost 12 years

@slhck Exactly what I wanted to say,The answer should be more explainatory.
Akshat Mittal almost 12 years

This worked but with a lot of mistakes (Incomplete URL's).
Akshat Mittal almost 12 years

I read a few more about this and got this working at last. Upvoting but not a really good answer to be choosen.
Akshat Mittal almost 12 years

How to load a remote file like http://server.com/sitemap.xml. I am not so known to Python
Ishikawa Yoshi almost 12 years

you mean load with python?
Akshat Mittal almost 12 years

Yup, Like you have used f = open('sitemap.xml','r') to open the file, How to open a remote file on http server?
Ishikawa Yoshi almost 12 years

do you import re module?
Ishikawa Yoshi almost 12 years

let us continue this discussion in chat
bobmagoo almost 12 years

Weird, I just ran this over Google's sitemap.xml file and didn't see any issues. Which ones did it miss?
Akshat Mittal almost 12 years

This missed many url's that contained "?" and "+".
trante almost 10 years

Thank you. For anybody wants to save to file grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml > links.txt
SmallChess about 9 years

+1 This is actually a very simple but powerful solution.
Baptiste Donaux about 8 years

Your solution works perfectly.
Łukasz Rysiak almost 8 years

For years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet <?xml-stylesheet type="text/xsl" version="1.0" href="stylesheet.xsl"?> Then open your xml in browser (it won't work when opening as local file, you have to get it via http)
Mike about 7 years

tried it as sed '/<loc>/!d; s/[[:space:]]*<loc>(.*)<\/loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work
LarS about 7 years

I think it's because you did not escape the () with ( and ).
My Name about 6 years

Very good answer! A reminder that if your links are in HTTPS, change http to https in the code).