Python using Beautiful Soup for HTML processing on specific content

python html parsing beautifulsoup

10,881

Solution 1

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

results in

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

Follow-up response to @eyquem:

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

gives

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

Solution 2

Yes , a special regex pattern must be written for every site.

But I think that

1- the treatments done with Beautiful Soup must be adapted to every site, too.

2- regexes are not so complicated to write, and with a little habit, it can be done quickly

I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

Here's the code for this new site:

import urllib
import re

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('Ingredients</h3>')

patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')

print '\n'.join(patingr.findall(ch,x))

EDIT

I downloaded and installed BeautifulSoup and ran a comparison with regex.

I don't think I did any error in my comparison code

import urllib
import re
from time import clock
import BeautifulSoup

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()


te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print res1
print
print res2
print
print 'res1==res2 is ',res1==res2

print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1

result

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

res1==res2 is  True

Regex : 0.00210892725193

BeautifulSoup : 2.32453566026

BeautifulSoup execution time / Regex execution time == 1102.23605776

No comment !

EDIT 2

I realized that in my code I don't use a regex, I employ a method that use a regex and find().

It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.

To know what we are comparing, we need the following codes.

In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".

These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)

import urllib
import re
import BeautifulSoup
from time import clock

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()

# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te

print '\nSimple regex , without x :',t0

and

# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

print '\nSimple regex , with x :',t1

and

# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te

print '\nRegex with flags , without x and y :',t10

and

# Regex with flags , with x and y 
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te

print '\nRegex with flags , without x and y :',t11

and

# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print '\nBeautifulSoup                      :',t2

result

Simple regex , without x           : 0.00230488284125

Simple regex , with x              : 0.00229121279385

Regex with flags , without x and y : 0.00758719458758

Regex with flags , with x and y    : 0.00183724493364

BeautifulSoup                      : 2.58728860791

The use of x has no influence on the speed for a simple regex.

The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

The more complicated regex with flags and with x and y takes 20 % of time less.

Well, the results are not very much changed, with or without x/y.

So my conclusion is the same

the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)

To what you wrote, Hugh, I would say:

When a regex is wrong, it is not faster nor slower. It doesn't run.

When a regex is wrong, the coder makes it becoming right, that's all.

I don't understand why 95% of the persons on stackoverflow.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.

10,881

Author by

Eric

Updated on June 05, 2022

Comments

Eric almost 2 years
So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients into a text file. The ingredients are located in:

< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between

< li class="plaincharacterwrap">

Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

Code:
```
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'
```
Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.

Any help would be appreciated! Thanks!
Eric about 13 years

Thanks again eyquem! I am very new to python and just started programming this year (at least with parsing), just did things like small programs and such. But it seems python is very good at handling this kind of stuff.
BasedRebel about 13 years

In general, parsing html with regex is evil (see the canonical response at stackoverflow.com/questions/1732348/…) - html parses as an annotated tree, regex cannot properly handle this. Yes, in limited cases you can hammer a nail in with a screwdriver - but why would you want to?
eyquem about 13 years

Beautiful Soup seems very simple to use, indeed. For this case, regex and BS are equivalently easy. But I think that BS is probably easier to manage in more complex cases. I will end to learn BS, one day.
eyquem about 13 years

@Hugh Bothwell I am tired to constantly see references to this 4352 times upvoted post. This post is 98 % amazing literature. Here is the 2 % other part: HTML is not a regular language and hence cannot be parsed by regular expressions. It's very little explanation. Hugh, I find your code more convincing than the cited post. No, it is not in limited cases that one may choose to use a hammer on a screw instead of a screwdriver : it is every time one wants the program run faster, as you'll see in the edit in my post.