Iterate through elements in html tree using BeautifulSoup, and produce an output that maintains the relative position of each element? in Python

28,974

To find all <div> elements that have class attribute from a given list:

#!/usr/bin/env python
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

with open('input.xml', 'rb') as file:
    soup = BeautifulSoup(file)

elements = soup.find_all("div", class_="header name quantity".split())
print("\n".join("{} {}".format(el['class'], el.get_text()) for el in elements))

Output

['header']  content 
['name']  content 
['quantity']  content 
['name']  content 
['quantity']  content 
['header']  content2 
['name']  content2 
['quantity']  content2 
['name']  content2 
['quantity']  content2 

There are also other methods that allows you to search, traverse html elements.

Share:
28,974
Christian
Author by

Christian

Updated on July 05, 2022

Comments

  • Christian
    Christian almost 2 years

    I have this code that does what I need it to do using Jsoup in Java

    Elements htmlTree = doc.body().select("*");
    
        Elements menuElements = new Elements();
    
        for(Element element : htmlTree) {
    
            if(element.hasClass("header")) 
                menuElements.add(element);
            if(element.hasClass("name"))
                menuElements.add(element);
            if(element.hasClass("quantity"))
                menuElements.add(element);
        }
    

    I want to do the same thing but in Python using BeautifulSoup. An example tree of the HTML I'm trying to scrape follows:

    <div class="header"> content </div>
         <div class="name"> content </div>
         <div class="quantity"> content </div>
         <div class="name"> content </div>
         <div class="quantity"> content </div>
    <div class="header"> content2 </div>
         <div class="name"> content2 </div>
         <div class="quantity"> content2 </div>
         <div class="name"> content2 </div>
         <div class="quantity"> content2 </div>
    

    etc.

    Basically I want the output to preserve the relative positions of each element. How would I got about doing that using Python and BeautifulSoup?

    EDIT:

    This is the python code I have (it's very naive) but maybe it can help?

    output = []
    
    for e in soup :
      if e["class"] == "pickmenucolmenucat" :
        output.append(e)
      if e["class"] == "pickmenucoldispname" :
        output.append(e)
      if e["class"] == "pickmenucolportions" :
        output.append(e)