Function in Python to clean up and normalize a URL

13,910

Solution 1

Take a look at urlparse.urlparse(). I've had good success with it.


note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html

Solution 2

It's done in scrapy:

http://nullege.com/codes/search/scrapy.utils.url.canonicalize_url

Canonicalize the given url by applying the following procedures:

  • sort query arguments, first by key, then by value
  • percent encode paths and query arguments. non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
  • normalize all spaces (in query arguments) '+' (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)

Solution 3

url-normalize might be what you're looking for.

Depending on your preference you can also probably:

  1. remove UTM parameters
  2. remove http(s)://
  3. remove www.
  4. remove trailing /

here is an example which does this:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['example.com',
'example.com/',
'http://example.com/',
'http://example.com',
'http://example.com?',
'http://example.com/?',
'http://example.com//',
'http://example.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

Which gives this result:

['example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com']

There are still issues with shortened links and redirects of various sorts but you'd need to make a request to the url to sort through those.

Share:
13,910
nikcub
Author by

nikcub

Updated on June 11, 2022

Comments

  • nikcub
    nikcub almost 2 years

    I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:

    example.com
    example.com/
    http://example.com/
    http://example.com
    http://example.com?
    http://example.com/?
    http://example.com//
    

    and output a clean consistent version:

    http://example.com/
    

    I looked through std libs and GitHub and couldn't find anything like this

    Update

    I couldn't find a Python library that implements everything discussed here and in the RFC:

    http://en.wikipedia.org/wiki/URL_normalization

    So I am writing one now. There is a lot more to this than I initially imagined.

  • jd.
    jd. about 13 years
    Along with urlparse.urlunparse().
  • nikcub
    nikcub about 13 years
    Thanks for that - for some reason I missed the normalization aspect of that function when I was reading the docs early this morning. Took me a few minutes to implement
  • nikcub
    nikcub about 13 years
    scratch that - the normalization fails on 70%+ of my test cases (I have 50 tests now). For some reason the python community was against implementing normalization per the RFC and per how browsers handle it: en.wikipedia.org/wiki/URL_normalization I found this python bug: bugs.python.org/issue4191
  • nikcub
    nikcub about 13 years
    to add, the urlparse normalization will not find those above URL's in the question to all be equal to each other, which is what is important.
  • Dawn Drescher
    Dawn Drescher over 7 years
    At least these days Scrapy imports this function from the w3lib package.
  • Pedro Lobito
    Pedro Lobito almost 7 years
    The Link is dead
  • Alexandre V.
    Alexandre V. over 5 years
    It does not answer the question.
  • Xonshiz
    Xonshiz over 4 years
    It doesn't clean certain cases like : print(urlparse.urlparse('http://example.com//episode/').getu‌​rl()). At least, it didn't clean out the // in the url in Python 2