Extract Links from a sitemap(xml)

13,802

Solution 1

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

Solution 2

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

Solution 3

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

Solution 4

Using XSLT, you can render it out with XPath

/url/loc
Share:
13,802

Related videos on Youtube

Akshat Mittal
Author by

Akshat Mittal

TL;DR I make things. I am primarily a Web &amp; App Developer by profession, but I blog when I feel like. I'm strongly passionate about how things work, and I love to code. Currently building Metacrypt &amp; CreateMyToken.

Updated on September 18, 2022

Comments

  • Akshat Mittal
    Akshat Mittal almost 2 years

    Lets say I have a sitemap.xml file with this data:

    <url>
    <loc>http://domain.com/pag1</loc>
    <lastmod>2012-08-25</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
    </url>
    <url>
    <loc>http://domain.com/pag2</loc>
    <lastmod>2012-08-25</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
    </url>
    <url>
    <loc>http://domain.com/pag3</loc>
    <lastmod>2012-08-25</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.9</priority>
    </url>
    

    I want to extract all the locations from it (data between <loc> and </loc>).

    Sample output be like:

    http://domain.com/pag1
    http://domain.com/pag2
    http://domain.com/pag3
    

    How to do this?

    • Admin
      Admin almost 12 years
      What OS are you using?
    • Admin
      Admin almost 12 years
      Windows 7 Ultimate X64 / Windows 8 Pro X64 or Ubuntu 12.04 Linux.
    • Admin
      Admin almost 12 years
      Nice setup. Using Terminal on the Ubuntu box, my answer below will get you what you need.
    • Admin
      Admin almost 12 years
      You can also use any text editor like SublimeText2 which can use regexp, you can get all data with it, or you can use python see my answer below.
  • slhck
    slhck almost 12 years
    Could you maybe expand your answer and show the XSLT instructions and the XPath queries needed?
  • Akshat Mittal
    Akshat Mittal almost 12 years
    @slhck Exactly what I wanted to say,The answer should be more explainatory.
  • Akshat Mittal
    Akshat Mittal almost 12 years
    This worked but with a lot of mistakes (Incomplete URL's).
  • Akshat Mittal
    Akshat Mittal almost 12 years
    I read a few more about this and got this working at last. Upvoting but not a really good answer to be choosen.
  • Akshat Mittal
    Akshat Mittal almost 12 years
    How to load a remote file like http://server.com/sitemap.xml. I am not so known to Python
  • Ishikawa Yoshi
    Ishikawa Yoshi almost 12 years
    you mean load with python?
  • Akshat Mittal
    Akshat Mittal almost 12 years
    Yup, Like you have used f = open('sitemap.xml','r') to open the file, How to open a remote file on http server?
  • Ishikawa Yoshi
    Ishikawa Yoshi almost 12 years
    do you import re module?
  • Ishikawa Yoshi
    Ishikawa Yoshi almost 12 years
  • bobmagoo
    bobmagoo almost 12 years
    Weird, I just ran this over Google's sitemap.xml file and didn't see any issues. Which ones did it miss?
  • Akshat Mittal
    Akshat Mittal almost 12 years
    This missed many url's that contained "?" and "+".
  • trante
    trante almost 10 years
    Thank you. For anybody wants to save to file grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml > links.txt
  • SmallChess
    SmallChess about 9 years
    +1 This is actually a very simple but powerful solution.
  • Baptiste Donaux
    Baptiste Donaux about 8 years
    Your solution works perfectly.
  • Łukasz Rysiak
    Łukasz Rysiak almost 8 years
    For years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet <?xml-stylesheet type="text/xsl" version="1.0" href="stylesheet.xsl"?> Then open your xml in browser (it won't work when opening as local file, you have to get it via http)
  • Mike
    Mike about 7 years
    tried it as sed '/<loc>/!d; s/[[:space:]]*<loc>(.*)<\/loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work
  • LarS
    LarS about 7 years
    I think it's because you did not escape the () with ( and ).
  • My Name
    My Name about 6 years
    Very good answer! A reminder that if your links are in HTTPS, change http to https in the code).