Python 3, beautiful soup, get next tag

12,946

I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"

To find the first <a href> element in the <div>:

links = []
for div in soup.find_all('div', 'product-list-item'): 
    a = div.find('a', href=True) # find <a> anywhere in <div>
    if a is not None:
       links.append(a['href'])

It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.

If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:

a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
   links.append(a['href'])

Or if <a> is not inside <div>:

a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
   links.append(a['href'])

There are many ways to search and navigate in BeautifulSoup.

If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Share:
12,946
user136036
Author by

user136036

Updated on June 04, 2022

Comments

  • user136036
    user136036 almost 2 years

    I have the following html part which repeates itself several times with other href links:

    <div class="product-list-item  margin-bottom">
    <a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
    

    Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item". Pretty new to beautifulsoup and nothing that I came up with worked.

    Thanks for your ideas.

    EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.

    EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):

    soup = bs4.BeautifulSoup(htmlsource)
    x = soup.find_all("div")
    for i in range(len(x)):
        if x[i].get("class") and "product-list-item" in x[i].get("class"):
            print(x[i].get("class"))
    

    This will give me a list of all the "product-list-item" but then I tried something like

    print(x[i].get("class").next_element)
    

    Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:

    print(x[i][0].get("class").next_element)
    

    Which led to this error: return self.attrs[key] KeyError: 0. Also tried with .find_all("href") and .get("href") but this all leads to the same errors.

    EDIT3: Ok seems I found out how to solve it, now I did:

    x = soup.find_all("div")
    
    for i in range(len(x)):    
        if x[i].get("class") and "product-list-item" in x[i].get("class"):
            print(x[i].next_element.next_element.get("href"))
    

    This can also be shortened by using another attribute to the find_all function:

    x = soup.find_all("div", "product-list-item")
    for i in x:
        print(i.next_element.next_element.get("href"))
    

    greetings