Parsing hostname and port from string or url

python regex parsing

38,305

Solution 1

I'm not that familiar with urlparse, but using regex you'd do something like:

p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'

m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'

Or, without port:

m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'

EDIT: fixed regex to also match 'www.abc.com 123'

Solution 2

You can use urlparse to get hostname from URL string:

from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com

Solution 3

>>> from urlparse import urlparse   
>>> aaa = urlparse('http://www.acme.com:456')

>>> aaa.hostname  
'www.acme.com'

>>> aaa.port   
456
>>>

Solution 4

The reason it fails for:

www.acme.com 456

is because it is not a valid URI. Why don't you just:

Replace the space with a :
Parse the resulting string by using the standard urlparse method

Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.

Solution 5

Method using urllib -

    from urllib.parse import urlparse
    url = 'https://stackoverflow.com/questions'
    print(urlparse(url))

Output -

ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')

Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

View more solutions

38,305

TonyM

Updated on July 09, 2022

Comments

TonyM almost 2 years
I can be given a string in any of these formats:
- url: e.g http://www.acme.com:456
- string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.

I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.

I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
- dejjub-AIS about 12 years
  
  what is the regex have you tried? if not regex what is the code you have wrote?
TonyM about 12 years

When I use urlparse on host:port it puts the hostname in the scheme rather than netloc.
TonyM about 12 years

Thanks this looks really useful.
ntziolis about 12 years

From the manual: "Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component."
TonyM about 12 years

I'm not saying it's wrong, but it doesn't seem the best way for processing the hostname:port format. And adding prefixes doesn't seem very elegant.
ntziolis about 12 years

Basically it boils down to this: 1. Do you normalize before parsing (using a standard function) or 2. do you try and use regex or something like it to handle the different formats while parsing. In my experience it's better to normalize since the regex solutions are easy to get wrong + you are replicating existing functionality.
TonyM about 12 years

At the moment, I'm thinking I'll use urlparse on the URL and the regex by @claesv on the hostname:port format.
claesv about 10 years

I'm assuming the down votes is because of this solution being overly complicated. I accept that, and agree with @ntziolis in saying that you should try to use standard functionality when possible.
James over 7 years

Standard urlparse wont work for string (not start with http(s) or //) so this solution seem helpful. Why downvote without explain.
Rodrigo Laguna about 6 years

I don't know why, but whn you run it as aaa = urlparse('www.acme.com:456') then aaa.hostname is None, do you know why? By the way, that's exactly what the question asks
ymbirtt over 5 years

@RodrigoLaguna Real late to the party here, but this sits as an unresolved question. There's a difference between urlparse('www.acme.com:456') and urlparse('http://www.acme.com:456'). From the docs, urlparse assumes an RFC1808-compliant URL, and won't recognise the network location correctly unless it's introduced with a // - docs.python.org/2/library/urlparse.html#urlparse.urlparse.
Anders Kaseorg over 5 years

This fails for URLs with literal IPv6 addresses like http://[2001:db8:85a3::8a2e:370:7334]:80/test.
user1156544 almost 5 years

In Python3 use: import urllib and urllib.parse.urlparse('http://....')
VoteCoffee about 3 years

Per @user1156544: In Python3 use: import urllib and urllib.parse.urlparse('http://....')