How do I get rid of characters like ' that appear instead of apostrophes?

26,587

The following BeautifulSoup documentation on entity conversion should be what you're looking for:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion

Share:
26,587
nindalf
Author by

nindalf

Hobby programmer who enjoys playing with C++ and Python.

Updated on July 09, 2022

Comments

  • nindalf
    nindalf almost 2 years

    Possible Duplicate:
    Convert XML/HTML Entities into Unicode String in Python

    I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules.

    response = urllib2.urlopen(url)
    soup = BeautifulSoup(response)
    responseString = str(soup)
    
    coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
    coarseResult = coarseExpression.findall(responseString)
    
    fineExpression = re.compile('<[^>]*>')
    fineResult = []
    
    for coarse in coarseResult:
        fine = fineExpression.sub('', coarse) 
        #print(fine)
        fineResult.append(fine)
    

    Unfortunately, characters like apostrophes appear in a corrupted manner like so - &#x27 ; Is there a way to avoid this? Or a way to replace them easily?