search for a string inside html source with python (3.3.1)
Solution 1
I'd recommend using a library such as Beautiful Soup if it's HTML you want to parse. No need for regex.
EDIT
Using the URL you just added, this is the sample code to get the HTML object out:
import BeautifulSoup
import re
import urllib
data = urllib.urlopen('http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07').read()
soup = BeautifulSoup.BeautifulSoup(data)
element = soup.find('span', attrs={'class': re.compile(r".*\btxt_resultad_busca_casamento\b.*")})
print element.text
This will find the HTML span
element on the page that has the class txt_resultad_busca_casamento
, which I believe is the data you're trying to extract. From there you can just parse the .text
attribute to get the exact data you're interested in.
EDIT 2
Oops, just realised that uses regular expressions... it seems class matching in BeautifulSoup isn't perfect! This line should work instead, at least until the site changes their HTML:
element = soup.find('div', attrs={'id': 'ctl00_body_uppBusca'}).find('span')
Solution 2
Given that you can't parse html with regular expression, if you treat your file as a bag of text you have to use regex or something like:
a = 'Resultado de Busca: Foram encontrados 264 casais' #your page text
num = int(a[a.index("encontrados")+len("encontrados"):a.index("casais")])
Ale M.
Updated on June 22, 2022Comments
-
Ale M. almost 2 years
I am working on a project to get information from a web page. in the html source I have the following:
Resultado de Busca: Foram encontrados 264 casais
I need to get the number between "encontrados" and "casais"
is there anyway in Python to do that? what string function should i use? i want o avoid using regular expression in this case.
import urllib.request f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07") s = f.read() print(s.split())
I got this so far, but now I am having trouble finding the number I need.
import urllib.request f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07") s = f.read() num = int(s[s.index("encontrados")+len("encontrados"):s.index("casais")])
this give me the error bellow
TypeError: Type str doesn't support the buffer API