BeautifulSoup webscraping find_all( ): finding exact match

python html regex web-scraping beautifulsoup

89,979

Solution 1

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

Solution 2

You can use CSS selectors like so:

result = soup.select('div.product.special')

css-selectors

89,979

Author by

user2436815

Updated on October 10, 2020

Comments

user2436815 over 3 years

I'm using Python and BeautifulSoup for web scraping.

Lets say I have the following html code to scrape:

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

If I do the following:

result = soup.find_all('div', {'class': 'product'})

the result includes ALL the products (1,2,3, and 4).

What should I do to find products whose class EXACTLY matches 'product'??

The Code I ran:

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result

Output:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

user2436815 about 10 years

Thanks for the reply, but I'm trying to find "product" div, not "product special" div.... using soup.select('div.product.special') would return 'special' products..
crunch about 10 years

Oops, misread your question. Well an alternative would be to remove divs matching ".product.special" then you can safely search for ".product" without encountering the others.
J0ANMM over 7 years

Finally a solution that works!! I had two classes to match and was using soup.find_all('div', {'class': ['class1','class2']}) but it was also taking divs that had only class2. With the it is doing what I would expect. No idea why the one I was using was not working though...
mike rodent over 3 years

Can't you nonetheless use this approach with the :not pseudo selector: div.product:not(.special) ?