Get URL from BeautifulSoup object
If the url
variable is a string of an actual URL, then you should just forget the BeautifulSoup here and use the same variable url
. You should be using BeautifulSoup to parse HTML code, not a simple URL. In fact, if you try to use it like this, you get a warning:
>>> from bs4 import BeautifulSoup
>>> url = "https://foo"
>>> soup = BeautifulSoup(url)
C:\Python27\lib\site-packages\bs4\__init__.py:336: UserWarning: "https://foo" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
Since the URL is just a string, BeautifulSoup doesn't really know what to do with it when you "soupify" it, except for wrapping it up in basic HTML:
>>> soup
<html><body><p>https://foo</p></body></html>
If you still wanted to extract the URL from this, you could just use .text
on the object, since it's the only thing in there:
>>> print(soup.text)
https://foo
If on the other hand url
is not really a URL at all but rather a bunch of HTML code (in which case the variable name would be very misleading), then how you'd extract a specific link inside would beg the question of how it's in your code. Doing a find
to get the first a
tag, then extracting the href
value would be one way.
>>> actual_html = '<html><body><a href="http://moo">My link text</a></body></html>'
>>> newsoup = BeautifulSoup(actual_html)
>>> newsoup.find('a')['href']
'http://moo'
QED
Updated on June 04, 2022Comments
-
QED almost 2 years
Somebody is handing my function a BeautifulSoup object (BS4) that he has gotten using the typical call:
soup = BeautifulSoup(url)
my code:
def doSomethingUseful(soup): url = soup.???
How do I get the original URL from the soup object? I tried reading the docs AND the BeautifulSoup source code... I'm still not sure.
-
Reedinationer about 5 yearsI'm not sure what your project entails, but you could look into selenium at selenium-python.readthedocs.io Selenium allows you to call
driver.current_url
which might solve your problem. As well you can run it in headless and many of the methods for finding elements on the page are basically the same as BeautifulSoup -
ᴀʀᴍᴀɴ about 5 yearsbeautiful soup doesn't get a URL , it gets a html and parse it.
-
QED about 5 years@ᴀʀᴍᴀɴ if you post that I'll accept it
-
ᴀʀᴍᴀɴ about 5 years@QED actually it is not an answer it is just a hint that shows you the way , I think it is better to be left as a comment.
-
QED about 5 yearsIt led to the answer for me, but ok. Thanks.
-
-
QED about 5 yearsOops! You are right and I should have known better. No problem, I'll just pass the original URL into my function. I'll accept your answer in a few minutes when SO allows it.