Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?
Solution 1
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text
is None, and .find_all()
fails to select the tag. Generally do not use the text
parameter if a tag contains any other html elements except text content.
You can resolve this issue if you use only the tag's name (and the href
keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda
to .find_all()
.
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href
argument.
Using .find_all()
.
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select()
with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]
Solution 2
You can also use attrs to get the href tag with regex search
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
Solution 3
First of all, use a different text editor that doesn't use curly quotes.
Second, remove the
text=True
flag from thesoup.find_all
Solution 4
You could solve this with just a couple lines of gazpacho:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:
'/file-one/additional'
Admin
Updated on October 10, 2020Comments
-
Admin over 3 years
I have the following:
html = '''<div class=“file-one”> <a href=“/file-one/additional” class=“file-link"> <h3 class=“file-name”>File One</h3> </a> <div class=“location”> Down </div> </div>'''
And would like to get just the text of
href
which is/file-one/additional
. So I did:from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') link_text = “” for a in soup.find_all(‘a’, href=True, text=True): link_text = a[‘href’] print “Link: “ + link_text
But it just prints a blank, nothing. Just
Link:
. So I tested it out on another site but with a different HTML, and it worked.What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the
href
?Thank you in advance and will be sure to upvote/accept answer!
-
MITHU over 4 yearsThought to inform you about a question that I find trouble figuring out myself. I'll be very glad if you give this post a go. Thanks.
-
Jean Monet over 3 yearsDo you know why calling directly
.href
does not work, but.attrs['href']
works fine? I just spent 15 min to debug this :(