search for a string inside html source with python (3.3.1)

15,911

Solution 1

I'd recommend using a library such as Beautiful Soup if it's HTML you want to parse. No need for regex.

EDIT

Using the URL you just added, this is the sample code to get the HTML object out:

import BeautifulSoup
import re
import urllib

data = urllib.urlopen('http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07').read()
soup = BeautifulSoup.BeautifulSoup(data)
element = soup.find('span', attrs={'class': re.compile(r".*\btxt_resultad_busca_casamento\b.*")})
print element.text

This will find the HTML span element on the page that has the class txt_resultad_busca_casamento, which I believe is the data you're trying to extract. From there you can just parse the .text attribute to get the exact data you're interested in.

EDIT 2

Oops, just realised that uses regular expressions... it seems class matching in BeautifulSoup isn't perfect! This line should work instead, at least until the site changes their HTML:

element = soup.find('div', attrs={'id': 'ctl00_body_uppBusca'}).find('span')

Solution 2

Given that you can't parse html with regular expression, if you treat your file as a bag of text you have to use regex or something like:

a = 'Resultado de Busca: Foram encontrados 264 casais' #your page text
num = int(a[a.index("encontrados")+len("encontrados"):a.index("casais")])
Share:
15,911
Ale M.
Author by

Ale M.

Updated on June 22, 2022

Comments

  • Ale M.
    Ale M. almost 2 years

    I am working on a project to get information from a web page. in the html source I have the following:

    Resultado de Busca: Foram encontrados 264 casais

    I need to get the number between "encontrados" and "casais"

    is there anyway in Python to do that? what string function should i use? i want o avoid using regular expression in this case.

    import urllib.request
    f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
    s = f.read()
    
    print(s.split())
    

    I got this so far, but now I am having trouble finding the number I need.

    import urllib.request
    f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
    s = f.read()
    
    num = int(s[s.index("encontrados")+len("encontrados"):s.index("casais")])
    

    this give me the error bellow

    TypeError: Type str doesn't support the buffer API