Parsing hostname and port from string or url

38,305

Solution 1

I'm not that familiar with urlparse, but using regex you'd do something like:

p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'

m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'

Or, without port:

m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'

EDIT: fixed regex to also match 'www.abc.com 123'

Solution 2

You can use urlparse to get hostname from URL string:

from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com

Solution 3

>>> from urlparse import urlparse   
>>> aaa = urlparse('http://www.acme.com:456')

>>> aaa.hostname  
'www.acme.com'

>>> aaa.port   
456
>>> 

Solution 4

The reason it fails for:

www.acme.com 456

is because it is not a valid URI. Why don't you just:

  1. Replace the space with a :
  2. Parse the resulting string by using the standard urlparse method

Try and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.

Solution 5

Method using urllib -

    from urllib.parse import urlparse
    url = 'https://stackoverflow.com/questions'
    print(urlparse(url))

Output -

ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')

Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python

Share:
38,305

Related videos on Youtube

TonyM
Author by

TonyM

Updated on July 09, 2022

Comments

  • TonyM
    TonyM almost 2 years

    I can be given a string in any of these formats:

    I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.

    I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.

    I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.

    • dejjub-AIS
      dejjub-AIS about 12 years
      what is the regex have you tried? if not regex what is the code you have wrote?
  • TonyM
    TonyM about 12 years
    When I use urlparse on host:port it puts the hostname in the scheme rather than netloc.
  • TonyM
    TonyM about 12 years
    Thanks this looks really useful.
  • ntziolis
    ntziolis about 12 years
    From the manual: "Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component."
  • TonyM
    TonyM about 12 years
    I'm not saying it's wrong, but it doesn't seem the best way for processing the hostname:port format. And adding prefixes doesn't seem very elegant.
  • ntziolis
    ntziolis about 12 years
    Basically it boils down to this: 1. Do you normalize before parsing (using a standard function) or 2. do you try and use regex or something like it to handle the different formats while parsing. In my experience it's better to normalize since the regex solutions are easy to get wrong + you are replicating existing functionality.
  • TonyM
    TonyM about 12 years
    At the moment, I'm thinking I'll use urlparse on the URL and the regex by @claesv on the hostname:port format.
  • claesv
    claesv about 10 years
    I'm assuming the down votes is because of this solution being overly complicated. I accept that, and agree with @ntziolis in saying that you should try to use standard functionality when possible.
  • James
    James over 7 years
    Standard urlparse wont work for string (not start with http(s) or //) so this solution seem helpful. Why downvote without explain.
  • Rodrigo Laguna
    Rodrigo Laguna about 6 years
    I don't know why, but whn you run it as aaa = urlparse('www.acme.com:456') then aaa.hostname is None, do you know why? By the way, that's exactly what the question asks
  • ymbirtt
    ymbirtt over 5 years
    @RodrigoLaguna Real late to the party here, but this sits as an unresolved question. There's a difference between urlparse('www.acme.com:456') and urlparse('http://www.acme.com:456'). From the docs, urlparse assumes an RFC1808-compliant URL, and won't recognise the network location correctly unless it's introduced with a // - docs.python.org/2/library/urlparse.html#urlparse.urlparse.
  • Anders Kaseorg
    Anders Kaseorg over 5 years
    This fails for URLs with literal IPv6 addresses like http://[2001:db8:85a3::8a2e:370:7334]:80/test.
  • user1156544
    user1156544 almost 5 years
    In Python3 use: import urllib and urllib.parse.urlparse('http://....')
  • VoteCoffee
    VoteCoffee about 3 years
    Per @user1156544: In Python3 use: import urllib and urllib.parse.urlparse('http://....')