Parsing hostname and port from string or url
Solution 1
I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'
Solution 2
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
Solution 3
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
Solution 4
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
- Replace the space with a
:
- Parse the resulting string by using the standard
urlparse
method
Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.
Solution 5
Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python
Related videos on Youtube
TonyM
Updated on July 09, 2022Comments
-
TonyM almost 2 years
I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
-
dejjub-AIS about 12 yearswhat is the regex have you tried? if not regex what is the code you have wrote?
-
TonyM about 12 yearsWhen I use urlparse on host:port it puts the hostname in the scheme rather than netloc.
-
TonyM about 12 yearsThanks this looks really useful.
-
ntziolis about 12 yearsFrom the manual: "Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component."
-
TonyM about 12 yearsI'm not saying it's wrong, but it doesn't seem the best way for processing the hostname:port format. And adding prefixes doesn't seem very elegant.
-
ntziolis about 12 yearsBasically it boils down to this: 1. Do you normalize before parsing (using a standard function) or 2. do you try and use regex or something like it to handle the different formats while parsing. In my experience it's better to normalize since the regex solutions are easy to get wrong + you are replicating existing functionality.
-
TonyM about 12 yearsAt the moment, I'm thinking I'll use urlparse on the URL and the regex by @claesv on the hostname:port format.
-
claesv about 10 yearsI'm assuming the down votes is because of this solution being overly complicated. I accept that, and agree with @ntziolis in saying that you should try to use standard functionality when possible.
-
James over 7 yearsStandard urlparse wont work for string (not start with http(s) or //) so this solution seem helpful. Why downvote without explain.
-
Rodrigo Laguna about 6 yearsI don't know why, but whn you run it as
aaa = urlparse('www.acme.com:456')
thenaaa.hostname
isNone
, do you know why? By the way, that's exactly what the question asks -
ymbirtt over 5 years@RodrigoLaguna Real late to the party here, but this sits as an unresolved question. There's a difference between
urlparse('www.acme.com:456')
andurlparse('http://www.acme.com:456')
. From the docs, urlparse assumes an RFC1808-compliant URL, and won't recognise the network location correctly unless it's introduced with a//
- docs.python.org/2/library/urlparse.html#urlparse.urlparse. -
Anders Kaseorg over 5 yearsThis fails for URLs with literal IPv6 addresses like
http://[2001:db8:85a3::8a2e:370:7334]:80/test
. -
user1156544 almost 5 yearsIn Python3 use:
import urllib
andurllib.parse.urlparse('http://....')
-
VoteCoffee about 3 yearsPer @user1156544: In Python3 use: import urllib and urllib.parse.urlparse('http://....')