BeautifulSoup webscraping find_all( ): finding exact match

89,979

Solution 1

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

Solution 2

You can use CSS selectors like so:

result = soup.select('div.product.special')

css-selectors

Share:
89,979
user2436815
Author by

user2436815

Updated on October 10, 2020

Comments

  • user2436815
    user2436815 over 3 years

    I'm using Python and BeautifulSoup for web scraping.

    Lets say I have the following html code to scrape:

    <body>
        <div class="product">Product 1</div>
        <div class="product">Product 2</div>
        <div class="product special">Product 3</div>
        <div class="product special">Product 4</div>
    </body>
    

    Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

    If I do the following:

    result = soup.find_all('div', {'class': 'product'})
    

    the result includes ALL the products (1,2,3, and 4).

    What should I do to find products whose class EXACTLY matches 'product'??


    The Code I ran:

    from bs4 import BeautifulSoup
    import re
    
    text = """
    <body>
        <div class="product">Product 1</div>
        <div class="product">Product 2</div>
        <div class="product special">Product 3</div>
        <div class="product special">Product 4</div>
    </body>"""
    
    soup = BeautifulSoup(text)
    result = soup.findAll(attrs={'class': re.compile(r"^product$")})
    print result
    

    Output:

    [<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]
    
  • user2436815
    user2436815 about 10 years
    Thanks for the reply, but I'm trying to find "product" div, not "product special" div.... using soup.select('div.product.special') would return 'special' products..
  • crunch
    crunch about 10 years
    Oops, misread your question. Well an alternative would be to remove divs matching ".product.special" then you can safely search for ".product" without encountering the others.
  • J0ANMM
    J0ANMM over 7 years
    Finally a solution that works!! I had two classes to match and was using soup.find_all('div', {'class': ['class1','class2']}) but it was also taking divs that had only class2. With the it is doing what I would expect. No idea why the one I was using was not working though...
  • mike rodent
    mike rodent over 3 years
    Can't you nonetheless use this approach with the :not pseudo selector: div.product:not(.special) ?