How to return plain text from Beautiful Soup instead of unicode

13,120

Solution 1

There is no such thing as plain text. What you see are bytes interpreted as text using incorrect character encoding i.e., the encoding of the strings is different from the one used by your terminal unless the error were introduced earlier by using incorrect character encoding for the web page.

print x calls str(x) that returns UTF-8 encoded string for BeautifulSoup objects.

Try:

print unicode(x)

Or:

print x.encode('ascii')

Solution 2

fromEncoding (which has been renamed to from_encoding for compliance with PEP8) tells the parser how to interpret the data in the input. What you (your browser or urllib) receive from the server is just a stream of bytes. In order to make sense of it, i.e. in order to build a sequence of abstract characters from this stream of bytes (this process is called decoding), one has to know how the information was encoded. This piece of information is required and you have to provide it in order to make sure that your code behaves properly. Wikipedia tells you how they encode the data, it's stated right at the top of the source of each of their web pages, e.g.

<meta charset="UTF-8" />

Hence, the bytestream received from Wikipedia's web servers should be interpreted with the UTF-8 codec. You should invoke

soup = BeautifulSoup(html, from_encoding='utf-8')

instead of BeautifulSoup(html, fromEncoding='gbk'), which tries to decode the bytestream with some Chinese character codec (I guess you blindly copied that piece of code from here).

You really need to make sure that you understand the basic concept of text encodings. Actually, you want unicode in the output, which is an abstract representation of a sequence of characters/symbols. In this context, there is no such thing as "plain English".

Share:
13,120
Alex Chumbley
Author by

Alex Chumbley

Hey everyone, I'm a student studying computer science. I love using Stack, so I thought I'd join up and try to help a fraction of the number of people who have helped me. I'm trained, I guess you could say, to develop in Java and Python, but I'm getting really passionate about web and especially mobile development.

Updated on June 04, 2022

Comments

  • Alex Chumbley
    Alex Chumbley over 1 year

    I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.

    Here is my code:

        site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
        hdr = {'User-Agent': 'Mozilla/5.0'}
        req = urllib2.Request(site,headers=hdr)  
        req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
        page = urllib2.urlopen(req)
        if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
            data = page.read()
            data = StringIO.StringIO(data)
            gzipper = gzip.GzipFile(fileobj=data)
            html = gzipper.read()
            soup = BeautifulSoup(html, fromEncoding='gbk')
        else:
            soup = BeautifulSoup(page)
    
        section = soup.find('span', id='Events').parent
        events = section.find_next('ul').find_all('li')
        print soup.originalEncoding
        for x in events:
            print x
    

    Bascially I want x to be in plain English. I get, instead, things that look like this:

    <li><a href="/wiki/153_BC" title="153 BC">153 BC</a> – <a href="/wiki/Roman_consul" title="Roman consul">Roman consuls</a> begin their year in office.</li>
    

    There's only one example in this particular string, but you get the idea.

    Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.

    If anyone knows how to fix this, I'd appreciate it. Thanks

    EDIT: Thanks J.F. for the tip, I now used this after my for loop:

        for x in events:
            x = x.encode('ascii')
            x = str(x)
            #Find Content
            regex2 = re.compile(">[^>]*<")
            textList = re.findall(regex2, x)
            text = "".join(textList)
            text = text.replace(">", "")
            text = text.replace("<", "")
            contents.append(text)
    

    However, I still get things like this:

    2013 &#8211; At least 60 people are killed and 200 injured in a stampede after celebrations at F&#233;lix Houphou&#235;t-Boigny Stadium in Abidjan, Ivory Coast.
    

    EDIT: Here is how I make my excel spreadsheet (csv) and send in my list

    rows = zip(days, contents)
    with open("events.csv", "wb") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)
    

    So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.