Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

python html web-scraping beautifulsoup

134,017

Solution 1

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]

Solution 2

You can also use attrs to get the href tag with regex search

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

Solution 3

First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all

Solution 4

You could solve this with just a couple lines of gazpacho:


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']

Which would output:

'/file-one/additional'

View more solutions

134,017

Author by

Admin

Updated on October 10, 2020

Comments

Admin over 3 years

I have the following:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

Thank you in advance and will be sure to upvote/accept answer!

MITHU over 4 years

Thought to inform you about a question that I find trouble figuring out myself. I'll be very glad if you give this post a go. Thanks.
Jean Monet over 3 years

Do you know why calling directly .href does not work, but .attrs['href'] works fine? I just spent 15 min to debug this :(