Using bs4 to extract text in html files
12,377
You shouldn't call open
, just pass the file name to the urlopen
:
import bs4, sys
from urllib import urlopen
webpage = urlopen(sys.argv[1]).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
FYI, you don't need urllib
for opening local files:
import bs4, sys
with open(sys.argv[1], 'r') as f:
webpage = f.read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Hope that helps.
Author by
Iykeln
Always like upgrading my knowledge-base in the programming world.
Updated on June 04, 2022Comments
-
Iykeln almost 2 years
Want to extract text from my html files. if I use below for a specific file:
import bs4, sys from urllib import urlopen #filin = open(sys.argv[1], 'r') filin = '/home/iykeln/Desktop/R_work/file1.html' webpage = urlopen(filin).read().decode('utf-8') soup = bs4.BeautifulSoup(webpage) for node in soup.findAll('html'): print u''.join(node.findAll(text=True)).encode('utf-8')
it will work. But trying below for non specific files using open(sys.argv[1], 'r'):
import bs4, sys from urllib import urlopen filin = open(sys.argv[1], 'r') #filin = '/home/iykeln/Desktop/R_work/file1.html' webpage = urlopen(filin).read().decode('utf-8') soup = bs4.BeautifulSoup(webpage) for node in soup.findAll('html'): print u''.join(node.findAll(text=True)).encode('utf-8')
OR
import bs4, sys from urllib import urlopen with open(sys.argv[1], 'r') as filin: webpage = urlopen(filin).read().decode('utf-8') soup = bs4.BeautifulSoup(webpage) for node in soup.findAll('html'): print u''.join(node.findAll(text=True)).encode('utf-8')
I will be getting errors below:
Traceback (most recent call last): File "/home/iykeln/Desktop/py/clean.py", line 5, in <module> webpage = urlopen(filin).read().decode('utf-8') File "/usr/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.7/urllib.py", line 180, in open fullurl = unwrap(toBytes(fullurl)) File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap url = url.strip() AttributeError: 'file' object has no attribute 'strip'