Get URL from BeautifulSoup object

11,788

If the url variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:

>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup

Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:

>>> soup
<html><body><p>https://foo</p></body></html>

If you still wanted to extract the URL from this, you could just use .text on the object, since it's the only thing in there:

>>> print(soup.text)
https://foo

If on the other hand url is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find to get the first a tag, then extracting the href value would be one way.

>>> actual_html = '<html><body><a href="http://moo">My link text</a></body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'
Share:
11,788
QED
Author by

QED

Updated on June 04, 2022

Comments

  • QED
    QED almost 2 years

    Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:

    soup = BeautifulSoup(url)
    

    my code:

    def doSomethingUseful(soup):
        url = soup.???
    

    How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.

    • Reedinationer
      Reedinationer about 5 years
      I'm not sure what your project entails, but you could look into selenium at selenium-python.readthedocs.io Selenium allows you to call driver.current_url which might solve your problem. As well you can run it in headless and many of the methods for finding elements on the page are basically the same as BeautifulSoup
    • ᴀʀᴍᴀɴ
      ᴀʀᴍᴀɴ about 5 years
      beautiful soup doesn't get a URL , it gets a html and parse it.
    • QED
      QED about 5 years
      @ᴀʀᴍᴀɴ if you post that I'll accept it
    • ᴀʀᴍᴀɴ
      ᴀʀᴍᴀɴ about 5 years
      @QED actually it is not an answer it is just a hint that shows you the way , I think it is better to be left as a comment.
    • QED
      QED about 5 years
      It led to the answer for me, but ok. Thanks.
  • QED
    QED about 5 years
    Oops! You are right and I should have known better. No problem, I'll just pass the original URL into my function. I'll accept your answer in a few minutes when SO allows it.