HTML parsing with BeautifulSoup 4 and Python

15,178

It should be something like:

for link in soup.findAll('div', {'class': 'listEntry '}):
    print link

Attributes are specified in a dictionary -findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

==========

upd. Sorry it seems in bs4 you can do that way as well.

As for the fault, the form you're looking for is not in the results because it encloses listEntries as far as I can see.

What's wrong with that:

form = soup.find('form', id='parkAndCompareVehicle')
print len(form.find_all('div', 'listEntry '))
Share:
15,178
user1010775
Author by

user1010775

Updated on June 27, 2022

Comments

  • user1010775
    user1010775 almost 2 years

    I am trying to parse the resultlist of http://mobile.de.

    First I tried it with the HTMLParser Class, but I got an error: HTMLParser.HTMLParseError: EOF in middle of construct.

    So I tried it with BeautifulSoup 4 which is better for non-valid websites, but the <div> I’m Searching for isn’t accessible, and I can’t tell if it’s my fault or the website’s.

    from bs4 import BeautifulSoup
        import urllib
        import socket
    
        searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
        f = urllib.urlopen(searchurl)
        html = f.read()
        soup = BeautifulSoup(html)
    
        for link in soup.find_all("div","listEntry "):
            print link
    

    listEntry is the <div> with the result of the cars. But it seems that he isn’t parsing <form id="parkAndCompareVehicle" name="parkAndCompareVehicle" action="">. I can’t find the form in the soupobject.

    Where is the fault?

  • Joe Dargie
    Joe Dargie about 12 years
    That’s the stuff — see crummy.com/software/BeautifulSoup/bs3/…**kwargs)
  • user1010775
    user1010775 about 12 years
    It looks (findALL) that u mean BeautifulSoup3, but iam using Version 4. If i use find_all('div','btn') i get the the content of the btn-div. That isnt the fault. see doku bs 4.0
  • user1010775
    user1010775 about 12 years
    print len(form.findAll('div', 'listEntry ')) AttributeError: 'NoneType' object has no attribute 'findAll' But bs3 seems to work, i only need now to get the href out of it. Any idea to this ? in bs4 its easy with .get('href')....
  • gorlum0
    gorlum0 about 12 years
    @user1010775: Same get, or even simpler - soup.a['href']. But this code with find_all does work for me (correctly prints 20)! Maybe they banned you or something?)
  • user1010775
    user1010775 about 12 years
    If they banned me, i couldnt get it with bs3, i got it working now. thanks to all!