Using bs4 to extract text in html files

12,377

You shouldn't call open, just pass the file name to the urlopen:

import bs4, sys
from urllib import urlopen

webpage = urlopen(sys.argv[1]).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

FYI, you don't need urllib for opening local files:

import bs4, sys

with open(sys.argv[1], 'r') as f:
    webpage = f.read().decode('utf-8')

soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

Hope that helps.

Share:
12,377
Iykeln
Author by

Iykeln

Always like upgrading my knowledge-base in the programming world.

Updated on June 04, 2022

Comments

  • Iykeln
    Iykeln almost 2 years

    Want to extract text from my html files. if I use below for a specific file:

    import bs4, sys
    from urllib import urlopen
    #filin = open(sys.argv[1], 'r')
    filin = '/home/iykeln/Desktop/R_work/file1.html' 
    webpage = urlopen(filin).read().decode('utf-8')
    soup = bs4.BeautifulSoup(webpage)
    for node in soup.findAll('html'):
        print u''.join(node.findAll(text=True)).encode('utf-8')
    

    it will work. But trying below for non specific files using open(sys.argv[1], 'r'):

    import bs4, sys
    from urllib import urlopen
    filin = open(sys.argv[1], 'r')
    #filin = '/home/iykeln/Desktop/R_work/file1.html' 
    webpage = urlopen(filin).read().decode('utf-8')
    soup = bs4.BeautifulSoup(webpage)
    for node in soup.findAll('html'):
        print u''.join(node.findAll(text=True)).encode('utf-8')
    

    OR

    import bs4, sys
    from urllib import urlopen
    with open(sys.argv[1], 'r') as filin:
        webpage = urlopen(filin).read().decode('utf-8')
        soup = bs4.BeautifulSoup(webpage)
        for node in soup.findAll('html'):
            print u''.join(node.findAll(text=True)).encode('utf-8')
    

    I will be getting errors below:

    Traceback (most recent call last):
      File "/home/iykeln/Desktop/py/clean.py", line 5, in <module>
        webpage = urlopen(filin).read().decode('utf-8')
      File "/usr/lib/python2.7/urllib.py", line 87, in urlopen
        return opener.open(url)
      File "/usr/lib/python2.7/urllib.py", line 180, in open
        fullurl = unwrap(toBytes(fullurl))
      File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap
        url = url.strip()
    AttributeError: 'file' object has no attribute 'strip'