Function in Python to clean up and normalize a URL
Solution 1
Take a look at urlparse.urlparse()
. I've had good success with it.
note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse
module has been named to urllib.parse
. The corresponding Python3 documentation for urllib.parse
can be found here:
https://docs.python.org/3/library/urllib.parse.html
Solution 2
It's done in scrapy:
http://nullege.com/codes/search/scrapy.utils.url.canonicalize_url
Canonicalize the given url by applying the following procedures:
- sort query arguments, first by key, then by value
- percent encode paths and query arguments. non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
- normalize all spaces (in query arguments) '+' (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
Solution 3
url-normalize might be what you're looking for.
Depending on your preference you can also probably:
- remove UTM parameters
- remove
http(s)://
- remove
www.
- remove trailing
/
here is an example which does this:
from w3lib.url import url_query_cleaner
from url_normalize import url_normalize
urls = ['example.com',
'example.com/',
'http://example.com/',
'http://example.com',
'http://example.com?',
'http://example.com/?',
'http://example.com//',
'http://example.com?utm_source=Google']
def canonical_url(u):
u = url_normalize(u)
u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)
if u.startswith("http://"):
u = u[7:]
if u.startswith("https://"):
u = u[8:]
if u.startswith("www."):
u = u[4:]
if u.endswith("/"):
u = u[:-1]
return u
list(map(canonical_url,urls))
Which gives this result:
['example.com',
'example.com',
'example.com',
'example.com',
'example.com',
'example.com',
'example.com',
'example.com']
There are still issues with shortened links and redirects of various sorts but you'd need to make a request to the url to sort through those.
nikcub
Updated on June 11, 2022Comments
-
nikcub almost 2 years
I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:
example.com example.com/ http://example.com/ http://example.com http://example.com? http://example.com/? http://example.com//
and output a clean consistent version:
http://example.com/
I looked through std libs and GitHub and couldn't find anything like this
Update
I couldn't find a Python library that implements everything discussed here and in the RFC:
http://en.wikipedia.org/wiki/URL_normalization
So I am writing one now. There is a lot more to this than I initially imagined.
-
jd. about 13 yearsAlong with urlparse.urlunparse().
-
nikcub about 13 yearsThanks for that - for some reason I missed the normalization aspect of that function when I was reading the docs early this morning. Took me a few minutes to implement
-
nikcub about 13 yearsscratch that - the normalization fails on 70%+ of my test cases (I have 50 tests now). For some reason the python community was against implementing normalization per the RFC and per how browsers handle it: en.wikipedia.org/wiki/URL_normalization I found this python bug: bugs.python.org/issue4191
-
nikcub about 13 yearsto add, the urlparse normalization will not find those above URL's in the question to all be equal to each other, which is what is important.
-
Dawn Drescher over 7 yearsAt least these days Scrapy imports this function from the w3lib package.
-
Pedro Lobito almost 7 yearsThe Link is dead
-
Alexandre V. over 5 yearsIt does not answer the question.
-
Xonshiz over 4 yearsIt doesn't clean certain cases like :
print(urlparse.urlparse('http://example.com//episode/').geturl())
. At least, it didn't clean out the//
in the url in Python 2